Skip to content
Merged
Show file tree
Hide file tree
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions .gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
./.streamlit/secrets.toml
231 changes: 103 additions & 128 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,16 +1,24 @@
## RAG Demo using Couchbase, Streamlit, Haystack, and OpenAI

This is a demo app built to chat with your custom PDFs using the vector search capabilities of Couchbase to augment the OpenAI results in a Retrieval-Augmented-Generation (RAG) model.
This is a demo app built to chat with your custom PDFs using the **Couchbase Vector Index** to augment the OpenAI results in a Retrieval-Augmented-Generation (RAG) model.

This demo uses **`CouchbaseQueryDocumentStore`** with GSI vector indexes, which offers:

- **High-performance vector search at massive scale** (billions of documents)
- **BHIVe (Hyperscale Vector Index)** support for pure vector search
- **Composite Vector Index** support for filtered vector search
- **SQL++ queries** for efficient vector retrieval
- **Low memory footprint** and concurrent updates & searches

### How does it work?

You can upload your PDFs with custom data & ask questions about the data in the chat box.
For each question, you will get two answers:

- one using RAG (Couchbase logo)
- one using pure LLM - Gemini Pro (🤖).
- one using pure LLM - OpenAI (🤖).

The RAG pipeline utilizes Haystack, Couchbase Vector Search, and a OpenAI model. It fetches relevant parts of the PDF using vector search and adds them as context for the language model.
The RAG pipeline utilizes Haystack, Couchbase GSI Vector Index (Hyperscale or Composite Index), and OpenAI models. It fetches relevant parts of the PDF using vector search & adds them as context for the language model.


### Setup and Installation
Expand All @@ -30,147 +38,114 @@ The RAG pipeline utilizes Haystack, Couchbase Vector Search, and a OpenAI model.
DB_BUCKET = "<bucket_name>"
DB_SCOPE = "<scope_name>"
DB_COLLECTION = "<collection_name>"
INDEX_NAME = "<vector_capable_fts_index_name>"
OPENAI_API_KEY = "<openai_api_key>"
```

- #### Create the Search Index on Full Text Service
- #### Create the Vector Index

We need to create the Search Index on the Full Text Service in Couchbase. For this demo, you can import the following index using the instructions.
This demo uses Couchbase new Vector Indexes (introduced in version 8.0). With this version you have two new options:

- [Couchbase Capella](https://docs.couchbase.com/cloud/search/import-search-index.html)
**Option 1: Hyperscale Vector Index** - Recommended in general due to its scalability, and used in this demo

BHIVe is optimized for pure vector search at scale. It's perfect for chatbots, RAG applications, and scenarios where you need fast vector similarity search on large datasets.

- Copy the index definition to a new file index.json
- Import the file in Capella using the instructions in the documentation.
- Click on Create Index to create the index.
**Option 2: Composite Vector Index**

Composite indexes combine vector fields with other scalar fields, allowing you to apply filters before vector search. This is useful when you need to narrow down results based on metadata (e.g., date, category, user_id) before performing vector similarity search.

- [Couchbase Server](https://docs.couchbase.com/server/current/search/import-search-index.html)

- Click on Search -> Add Index -> Import
- Copy the following Index definition in the Import screen
- Click on Create Index to create the index.
Learn more about these 2 vector indexes and when to use one over the other, [here](https://docs.couchbase.com/cloud/vector-index/use-vector-indexes.html).

- #### Key Components

- Streamlit: Provides the web interface
- Haystack: Orchestrates the RAG pipeline
- Couchbase: Serves as the vector store
- OpenAI: Supplies the language model

#### Index Definition

Here, we are creating the index `pdf_search` on the documents in the `haystack_collection` collection within the `haystack_scope` scope in the bucket `haystack_bucket`. The Vector field is set to `embeddings` with 1536 dimensions and the text field set to `text`. We are also indexing and storing all the fields under `metadata` in the document as a dynamic mapping to account for varying document structures. The similarity metric is set to `dot_product`. If there is a change in these parameters, please adapt the index accordingly.

```
{
"name": "pdf_search",
"type": "fulltext-index",
"sourceType": "gocbcore",
"sourceName": "haystack_bucket",
"planParams": {
"indexPartitions": 1,
"numReplicas": 0
},
"params": {
"doc_config": {
"docid_prefix_delim": "",
"docid_regexp": "",
"mode": "scope.collection.type_field",
"type_field": "type"
},
"mapping": {
"default_analyzer": "standard",
"default_datetime_parser": "dateTimeOptional",
"index_dynamic": true,
"store_dynamic": true,
"default_mapping": {
"dynamic": true,
"enabled": false
},
"types": {
"haystack_scope.haystack_collection": {
"dynamic": false,
"enabled": true,
"properties": {
"content": {
"enabled": true,
"fields": [
{
"docvalues": true,
"include_in_all": false,
"include_term_vectors": false,
"index": true,
"name": "content",
"store": true,
"type": "text"
}
]
},
"embedding": {
"enabled": true,
"dynamic": false,
"fields": [
{
"vector_index_optimized_for": "recall",
"docvalues": true,
"dims": 1536,
"include_in_all": false,
"include_term_vectors": false,
"index": true,
"name": "embedding",
"similarity": "dot_product",
"store": true,
"type": "vector"
}
]
},
"dataframe": {
"enabled": true,
"fields": [
{
"docvalues": true,
"include_in_all": false,
"include_term_vectors": false,
"index": true,
"name": "dataframe",
"store": true,
"analyzer": "keyword",
"type": "text"
}
]
},
"meta": {
"dynamic": true,
"enabled": true,
"properties": {
"name": {
"enabled": true,
"fields": [
{
"docvalues": true,
"include_in_all": false,
"include_term_vectors": false,
"index": true,
"name": "name",
"store": true,
"analyzer": "keyword",
"type": "text"
}
]
}
}
}
}
}
}
}
}
}
- Couchbase: Serves as the high-performance vector store
- OpenAI: Supplies embeddings and the language model

### Vector Index Creation

You need to create a Hyperscale vector index on your collection **after** loading some documents (required for index training). Choose between BHIVe or Composite Index based on your use case. Whichever vector index (Hyperscale or Composite) you choose won't affect the funbctionality of this demo, though performance differences may occur.

#### Option 1: Hyperscale Vector Index - Recommended

Hyperscale is a dedicated vector index optimized for pure vector search at massive scale. Use this for the best performance in RAG applications. Refer [here](https://docs.couchbase.com/cloud/vector-index/hyperscale-vector-index.html) for detailed instruction.

**Creating a Hyperscale Index using SQL++:**

You can create the index using the Couchbase Query Workbench or programmatically:

```sql
CREATE VECTOR INDEX idx_pdf_hyperscale
ON `bucket_name`.`scope_name`.`collection_name`(embedding VECTOR)
WITH {
"dimension": 1536,
"similarity": "DOT"
};
```

**Index Parameters Explained:**
- `dimension`: Must match your embedding model (1536 for OpenAI ada-002/ada-003, 768 for sentence-transformers)
- `similarity`: Must match the similarity metric in `CouchbaseQueryDocumentStore`
- `DOT`: Dot product (recommended for OpenAI embeddings)

#### Option 2: Composite Vector Index

Composite indexes combine vector fields with other scalar fields. This is useful when you need to filter documents by metadata before performing vector search.

**Creating a Composite Index using SQL++:**

```sql
CREATE INDEX idx_pdf_composite
ON `bucket_name`.`scope_name`.`collection_name`(embedding VECTOR)
USING GSI
WITH {
"dimension": 1536,
"similarity": "DOT"
};
```

#### Important Notes

1. **Index Creation Timing**: Hyperscale and Composite vector indexes require training data. Create the index **after** you've loaded the documents into your collection.

2. **Similarity Metric**: The `similarity` parameter in the index **must match** the `similarity` parameter in your `CouchbaseQueryDocumentStore` configuration.

3. **Dimension**: Must match your embedding model's output dimensions.

#### Verifying Your Index

After creating the index, verify it exists:

```sql
SELECT * FROM system:indexes
WHERE name=idx_pdf_composite;
```

- #### Run the application

`streamlit run chat_with_pdf.py`

For more details on implementation, refer to the code comments in chat_with_pdf.py.
### Implementation Details

This demo uses the following key components:

1. **CouchbaseQueryDocumentStore**: GSI-based document store with vector support
- Configured with `QueryVectorSearchType.ANN` for fast approximate nearest neighbor search
- Uses `QueryVectorSearchSimilarity.DOT` for dot product similarity (recommended for OpenAI embeddings)
- Supports both BHIVe and Composite indexes

2. **CouchbaseQueryEmbeddingRetriever**: High-performance retriever for GSI vector search
- Uses SQL++ queries with `APPROX_VECTOR_DISTANCE()` function for ANN search
- Retrieves top-k most similar documents based on embedding similarity

3. **OpenAI Embeddings**:
- `text-embedding-ada-002` model with 1536 dimensions
- Generates embeddings for both documents and queries

For more details on implementation, refer to the extensive code comments in `chat_with_pdf.py`.

### Additional Resources

- [Couchbase Vector Index Documentation](https://docs.couchbase.com/cloud/vector-index/vectors-and-indexes-overview.html)
- [Haystack Documentation](https://docs.haystack.deepset.ai/)
- [couchbase-haystack GitHub Repository](https://github.com/Couchbase-Ecosystem/couchbase-haystack)
36 changes: 27 additions & 9 deletions chat_with_pdf.py
Original file line number Diff line number Diff line change
@@ -1,6 +1,7 @@
import os
import tempfile
import streamlit as st
from datetime import timedelta
from haystack import Pipeline
from haystack.components.converters import PyPDFToDocument
from haystack.components.preprocessors import DocumentCleaner, DocumentSplitter
Expand All @@ -9,7 +10,18 @@
from haystack.components.builders import PromptBuilder, AnswerBuilder
from haystack.components.writers import DocumentWriter
from haystack.utils import Secret
from couchbase_haystack import CouchbaseSearchDocumentStore, CouchbaseSearchEmbeddingRetriever, CouchbasePasswordAuthenticator, CouchbaseClusterOptions
from couchbase.n1ql import QueryScanConsistency

# Import CouchbaseQueryDocumentStore for GSI-based vector search with BHIVe support
from couchbase_haystack import (
CouchbaseQueryDocumentStore,
CouchbaseQueryEmbeddingRetriever,
CouchbasePasswordAuthenticator,
CouchbaseClusterOptions,
QueryVectorSearchType,
QueryVectorSearchSimilarity,
CouchbaseQueryOptions
)

def check_environment_variable(variable_name):
"""Check if environment variable is set"""
Expand All @@ -31,8 +43,8 @@ def save_to_vector_store(uploaded_file, indexing_pipeline):

@st.cache_resource(show_spinner="Connecting to Vector Store")
def get_document_store():
"""Return the Couchbase document store"""
return CouchbaseSearchDocumentStore(
"""Return the Couchbase document store using CouchbaseQueryDocumentStore."""
return CouchbaseQueryDocumentStore(
cluster_connection_string=Secret.from_env_var("DB_CONN_STR"),
authenticator=CouchbasePasswordAuthenticator(
username=Secret.from_env_var("DB_USERNAME"),
Expand All @@ -42,7 +54,13 @@ def get_document_store():
bucket=os.getenv("DB_BUCKET"),
scope=os.getenv("DB_SCOPE"),
collection=os.getenv("DB_COLLECTION"),
vector_search_index=os.getenv("INDEX_NAME"),
search_type=QueryVectorSearchType.ANN,
similarity=QueryVectorSearchSimilarity.DOT,
nprobes=10,
query_options=CouchbaseQueryOptions(
timeout=timedelta(seconds=60),
scan_consistency=QueryScanConsistency.NOT_BOUNDED
)
)


Expand All @@ -57,7 +75,7 @@ def get_document_store():
)

# Load and check environment variables
env_vars = ["DB_CONN_STR", "DB_USERNAME", "DB_PASSWORD", "DB_BUCKET", "DB_SCOPE", "DB_COLLECTION", "INDEX_NAME", "OPENAI_API_KEY"]
env_vars = ["DB_CONN_STR", "DB_USERNAME", "DB_PASSWORD", "DB_BUCKET", "DB_SCOPE", "DB_COLLECTION", "OPENAI_API_KEY"]
for var in env_vars:
check_environment_variable(var)

Expand All @@ -80,7 +98,7 @@ def get_document_store():
# Create RAG pipeline
rag_pipeline = Pipeline()
rag_pipeline.add_component("query_embedder", OpenAITextEmbedder())
rag_pipeline.add_component("retriever", CouchbaseSearchEmbeddingRetriever(document_store=document_store))
rag_pipeline.add_component("retriever", CouchbaseQueryEmbeddingRetriever(document_store=document_store))
rag_pipeline.add_component("prompt_builder", PromptBuilder(template="""
You are a helpful bot. If you cannot answer based on the context provided, respond with a generic answer. Answer the question as truthfully as possible using the context below:
{% for doc in documents %}
Expand Down Expand Up @@ -109,7 +127,7 @@ def get_document_store():
couchbase_logo = "https://emoji.slack-edge.com/T024FJS4M/couchbase/4a361e948b15ed91.png"

st.title("Chat with PDF")
st.markdown("Answers with [Couchbase logo](https://emoji.slack-edge.com/T024FJS4M/couchbase/4a361e948b15ed91.png) are generated using *RAG* while 🤖 are generated by pure *LLM (Gemini)*")
st.markdown("Answers with [Couchbase logo](https://emoji.slack-edge.com/T024FJS4M/couchbase/4a361e948b15ed91.png) are generated using *RAG* while 🤖 are generated by pure *LLM (OpenAI)*")

with st.sidebar:
st.header("Upload your PDF")
Expand All @@ -123,10 +141,10 @@ def get_document_store():
st.markdown("""
For each question, you will get two answers:
* one using RAG ([Couchbase logo](https://emoji.slack-edge.com/T024FJS4M/couchbase/4a361e948b15ed91.png))
* one using pure LLM - Gemini (🤖).
* one using pure LLM - OpenAI (🤖).
""")

st.markdown("For RAG, we are using [Haystack](https://haystack.deepset.ai/), [Couchbase Vector Search](https://couchbase.com/) & [Gemini](https://gemini.google.com/). We fetch parts of the PDF relevant to the question using Vector search & add it as the context to the LLM. The LLM is instructed to answer based on the context from the Vector Store.")
st.markdown("For RAG, we are using [Haystack](https://haystack.deepset.ai/), [Couchbase Vector Search](https://docs.couchbase.com/cloud/vector-index/hyperscale-vector-index.html) & [OpenAI](https://openai.com/). We fetch parts of the PDF relevant to the question using high-performance GSI vector search & add it as the context to the LLM. The LLM is instructed to answer based on the context from the Vector Store.")

if "messages" not in st.session_state:
st.session_state.messages = []
Expand Down
2 changes: 1 addition & 1 deletion requirements.txt
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
streamlit==1.42.2
couchbase-haystack==2.0.0
couchbase-haystack==2.1.0
pypdf==5.3.0
nltk>=3.9.1