An LLM-based application that communicates with a DB to extract information.
The application takes a user input (question related to or inquiry about the DB) in natural language > [converts to SQL query > queries the DB > returns result to LLM] (iteratively until the LLM is satisfied) > returns final output to user
To view this Streamlit app on a browser, run it with the following command:
streamlit run PATH\main.py [ARGUMENTS]
- Note: This code connects to the DB using
postgres. To connect to a different DB, please replace the type of DB and adapter to match your SQL engine of choice (in thepg_urivariable in the 'helper_funcs.py' module). The rest of the code should work just as fine 👍🏼.
Additional notes:
Make sure to create a .env file for storing your OpenAI api key and db-connection variables.
Use the following format:
OPENAI_API_KEY = "your_openai_api_key"DB_USER = "your_db_user"DB_PASSWORD = "your_db_password"DB_HOST = "your_db_host"DB_PORT = intDB_NAME = "your_db_name"
LangChain 🦜🔗 (a framework to develop applications on LLMs):
- Actively developing SQL agent ability
- Seamless connection with OpenAI
- Open-source
- Under constant development
- Well-documented
- Extensive usage
- Extensive community support
- Reliable
- Text-to-SQL query VS Text-to-SQL query and execution VS SQL Agent
- Open-source
- Great documentation
- Community support
- Customization
- Pure Python
- Seamless integrations
- Deployment
- LLM and ML support
- A good alternative that shares most of Streamlit's pros.
- A suitable a good substitute.
- I personally like Streamlit, and since it was recently acquired by SnowFlake, believe in its potential to grow much faster. Yet, both are great options.
- Used: model='gpt-3.5-turbo-1106'
- Seamless connection with LangChain
- Best performing model in the market
- Regularly updated offering continuous improvements
- Hallucinations
Alternatives: Open-source LLMs 🦙 (pros & cons):
- A plethora of fine-tuned open-source models for sql-generation
- Much more cost-effective
- Docker
- Hallucinations
- Incorporate a combined approach of using an open-source LLM that was fine-tuned for SQL generation alongside LangChain's SQL query execution.
- This is a valid choice, yet it means waiving the main advantages of the SQL Agent. Namely, recovering from errors + answering questions based on the databases’ schema as well as on the databases’ content.
- Open-source and free
- Good support and documentation
- Reliable
- Supports a wide range of extensions
- Seamlessly integrates with Python
- Download the non-commercial IMDB dataset from IMDB datasets
- See IMDB Non-commercial datasets for file description
- Unzip files (
tsv.gzfiles) - Preprocess files before uploading to the Database
- Handling null values correctly
- Handling mixed data types in the same column
-
Choose an appropriate SQL Engine
-
Set-up a new database
-
Create tables and columns that correspond to the TSV files
-
Column data types: When deciding on dtypes take into account performance and storage issues and considerations:
str- checkdf['col'].str.len().max()to decide onnforVARCHAR(n)int-INTEGERvsSMALLINTSMALLINTrelevant for:title.basics.tsv(startYearandendYearcolumns) andname.basics.tsv(birthYearanddeathYearcolumns)
float-DOUBLE PRECISIONbool-BOOLEANList[str]-VARCHAR(n)[]unless behaves likeVARCHAR(n)- Relevant for:
title.akas.tsv(typesandattribuitescolumns) andtitle.basics.tsv(genrescolumn)
- Relevant for:
-
Import the preprocess TSV files into the newly created tables
- Exceptionally large files should be broken into smaller files and uploaded sequentially in batches.
- Write errors to file and inspect
error_file.txtto further handle any problematic records. - Insert inconvertible values as null (mark the tick-box in the 'Data Import' dialogue box)
- Problem: LangChain-OpenAI incompatibility issues due to OpenAI's
ChatCompletionendpoint change not yet integrated in LangChain.- Solution: Downgrade to
openai==0.28(versions 1.0.0+ will not work until LangChain is updated accordingly).
- Solution: Downgrade to
- Issue: LangChain
return_intermediate_stepsparam isn't yet applicable for LangChain SQL Agents.- Solution: Consider using a Text-to-SQL query (and Execution) instead to access intermediate steps.
- Considerations: LangChain's SQL Agent provides a more flexible and advanced way of interacting with SQL Databases (see Part I above - Frameworks)
- OpenAI
temperatureparam - set to value of0to block model creativeness when translating free text client Qs to SQL queries.
User prompt > LangChain employs a SQL Agent to handle the back-and-forth deliberations between the LLM and the DB (using the SQL tool by proxy) that queries and returns responses from the DB > Until the LLM is satisfied with the result > Return response to client
- Set primary and foreign keys (Database)
- Improve prompt to LLM with few-show examples of natural language requests and corresponding SQL queries (LangChain)
- Improve LLM retrieval with vectorBD storage and similarity comparison to other relevant SQL queries.
- Export the DB to a compressed file
- Bash command -
pg_dump -U [username] -h [hostname] -d [databasename] | gzip > [full_path_of_output_file.sql.gz]
- Bash command -
- Docker
- Set params in
secret.pyto environment variables - Integrate a chatbot to interact with the user, accept secret params and set them accordingly in the virtual environment to establish a live connection to the database.





