This repository offers a comprehensive setup for implementing a hybrid search system that emphasizes recency in search results. In many real-world scenarios, retrieving the most recent information is crucial, especially in dynamic domains like news, social media, or stock market data. This system is designed to balance relevance and freshness, ensuring that users receive the latest and most pertinent results.
The solution leverages PostgreSQL's pgvector extension, which allows for efficient handling of vector data within the database. By integrating vector search capabilities directly into PostgreSQL, this setup facilitates seamless interactions with vector databases
Before you begin, ensure you have the following installed:
- Docker: To run the containers.
- Create .env file. You can copy .env.example as .env
- Embedding access:
- AWS Bedrock or OpenAI API key for embedding generation.
- Note: If you would like to use OpenAI API for embedding generation, copy api key to .env file.
Start by spinning up the necessary containers. This will initialize a PostgreSQL database, a worker instance for running commands, and pgweb for visualizing data and running queries.
docker compose up -d
To enable vector operations in PostgreSQL, you need to create the vector extension. You have two options:
-
Option 1: Run the following command to create the extension directly in the container.
docker exec -it pg_container psql -U postgres -d vectordb -c "CREATE EXTENSION IF NOT EXISTS vector;"
-
Option 2: Use pgweb to run the extension query.
- Navigate to
http://localhost:8002/
and execute:CREATE EXTENSION IF NOT EXISTS vector;
- Navigate to
Ingest your data into the PostgreSQL database using the provided Python script.
docker exec -t worker bash -c "python ./search/ingest-to-pg.py -r prod"
First, count the number of functions in the public namespace to verify setup.
docker exec -t pg_container psql -U postgres -d vectordb -c "SELECT COUNT(*) FROM pg_proc WHERE pronamespace = 'public'::regnamespace;"
Copy the SQL files for hybrid and combo retrieval functions into the PostgreSQL container.
docker cp ./search/hybrid-retrieval.sql pg_container:/home/hybrid-retrieval.sql && docker cp ./search/combo-retrieval.sql pg_container:/home/combo-retrieval.sql
Run the SQL files to add the necessary functions for hybrid and combo retrieval.
Hybrid retrieval function
docker exec -t pg_container psql -U postgres -d vectordb -f /home/hybrid-retrieval.sql
Now, check count the number of functions in the public namespace to verify setup.
docker exec -t pg_container psql -U postgres -d vectordb -c "SELECT COUNT(*) FROM pg_proc WHERE pronamespace = 'public'::regnamespace;"
Combo retrieval function
docker exec -t pg_container psql -U postgres -d vectordb -f /home/combo-retrieval.sql
Optional: Check count the number of functions again in the public namespace to verify if function is added.
docker exec -t pg_container psql -U postgres -d vectordb -c "SELECT COUNT(*) FROM pg_proc WHERE pronamespace = 'public'::regnamespace;"
Execute the search script to perform queries.
docker exec -t worker bash -c "python ./search/search.py -c"
docker exec -t worker bash -c "python ./search/search.py -h"
Remove all docker containers along with data ingested in postgres database
docker compose down --volumes