couchbase-examples · shyam-cb · Nov 11, 2025 · Oct 29, 2025 · Oct 29, 2025 · Oct 29, 2025
diff --git a/.gitignore b/.gitignore
@@ -0,0 +1 @@
+./.streamlit/secrets.toml
diff --git a/README.md b/README.md
@@ -1,16 +1,24 @@
 ## RAG Demo using Couchbase, Streamlit, Haystack, and OpenAI
 
-This is a demo app built to chat with your custom PDFs using the vector search capabilities of Couchbase to augment the OpenAI results in a Retrieval-Augmented-Generation (RAG) model.
+This is a demo app built to chat with your custom PDFs using the **Couchbase Vector Index** to augment the OpenAI results in a Retrieval-Augmented-Generation (RAG) model.
+
+This demo uses **`CouchbaseQueryDocumentStore`** with GSI vector indexes, which offers:
+
+- **High-performance vector search at massive scale** (billions of documents)
+- **BHIVe (Hyperscale Vector Index)** support for pure vector search
+- **Composite Vector Index** support for filtered vector search
+- **SQL++ queries** for efficient vector retrieval
+- **Low memory footprint** and concurrent updates & searches
 
 ### How does it work?
 
 You can upload your PDFs with custom data & ask questions about the data in the chat box.
 For each question, you will get two answers:
 
 - one using RAG (Couchbase logo)
-- one using pure LLM - Gemini Pro (🤖).
+- one using pure LLM - OpenAI (🤖).
 
-The RAG pipeline utilizes Haystack, Couchbase Vector Search, and a OpenAI model. It fetches relevant parts of the PDF using vector search and adds them as context for the language model.
+The RAG pipeline utilizes Haystack, Couchbase GSI Vector Index (Hyperscale or Composite Index), and OpenAI models. It fetches relevant parts of the PDF using vector search & adds them as context for the language model.
 
 
 ### Setup and Installation
@@ -30,147 +38,114 @@ The RAG pipeline utilizes Haystack, Couchbase Vector Search, and a OpenAI model.
     DB_BUCKET = "<bucket_name>"
     DB_SCOPE = "<scope_name>"
     DB_COLLECTION = "<collection_name>"
-    INDEX_NAME = "<vector_capable_fts_index_name>"
     OPENAI_API_KEY = "<openai_api_key>"
 ```
 
-- #### Create the Search Index on Full Text Service
+- #### Create the Vector Index
 
-  We need to create the Search Index on the Full Text Service in Couchbase. For this demo, you can import the following index using the instructions.
+  This demo uses Couchbase new Vector Indexes (introduced in version 8.0). With this version you have two new options:
 
-  - [Couchbase Capella](https://docs.couchbase.com/cloud/search/import-search-index.html)
+  **Option 1: Hyperscale Vector Index** - Recommended in general due to its scalability, and used in this demo
+
+  BHIVe is optimized for pure vector search at scale. It's perfect for chatbots, RAG applications, and scenarios where you need fast vector similarity search on large datasets.
 
-    - Copy the index definition to a new file index.json
-    - Import the file in Capella using the instructions in the documentation.
-    - Click on Create Index to create the index.
+  **Option 2: Composite Vector Index**
+
+  Composite indexes combine vector fields with other scalar fields, allowing you to apply filters before vector search. This is useful when you need to narrow down results based on metadata (e.g., date, category, user_id) before performing vector similarity search.
 
-  - [Couchbase Server](https://docs.couchbase.com/server/current/search/import-search-index.html)
-
-    - Click on Search -> Add Index -> Import
-    - Copy the following Index definition in the Import screen
-    - Click on Create Index to create the index.
+  Learn more about these 2 vector indexes and when to use one over the other, [here](https://docs.couchbase.com/cloud/vector-index/use-vector-indexes.html).
 
 - #### Key Components
 
     - Streamlit: Provides the web interface
     - Haystack: Orchestrates the RAG pipeline
-    - Couchbase: Serves as the vector store
-    - OpenAI: Supplies the language model
-
-  #### Index Definition
-
-  Here, we are creating the index `pdf_search` on the documents in the `haystack_collection` collection within the `haystack_scope` scope in the bucket `haystack_bucket`. The Vector field is set to `embeddings` with 1536 dimensions and the text field set to `text`. We are also indexing and storing all the fields under `metadata` in the document as a dynamic mapping to account for varying document structures. The similarity metric is set to `dot_product`. If there is a change in these parameters, please adapt the index accordingly.
-
-  ```
-    {
-        "name": "pdf_search",
-        "type": "fulltext-index",
-        "sourceType": "gocbcore",
-        "sourceName": "haystack_bucket",
-        "planParams": {
-            "indexPartitions": 1,
-            "numReplicas": 0
-        },
-        "params": {
-            "doc_config": {
-                "docid_prefix_delim": "",
-                "docid_regexp": "",
-                "mode": "scope.collection.type_field",
-                "type_field": "type"
-            },
-            "mapping": {
-                "default_analyzer": "standard",
-                "default_datetime_parser": "dateTimeOptional",
-                "index_dynamic": true,
-                "store_dynamic": true,
-                "default_mapping": {
-                    "dynamic": true,
-                    "enabled": false
-                },
-                "types": {
-                    "haystack_scope.haystack_collection": {
-                        "dynamic": false,
-                        "enabled": true,
-                        "properties": {
-                            "content": {
-                                "enabled": true,
-                                "fields": [
-                                    {
-                                        "docvalues": true,
-                                        "include_in_all": false,
-                                        "include_term_vectors": false,
-                                        "index": true,
-                                        "name": "content",
-                                        "store": true,
-                                        "type": "text"
-                                    }
-                                ]
-                            },
-                            "embedding": {
-                                "enabled": true,
-                                "dynamic": false,
-                                "fields": [
-                                    {
-                                        "vector_index_optimized_for": "recall",
-                                        "docvalues": true,
-                                        "dims": 1536,
-                                        "include_in_all": false,
-                                        "include_term_vectors": false,
-                                        "index": true,
-                                        "name": "embedding",
-                                        "similarity": "dot_product",
-                                        "store": true,
-                                        "type": "vector"
-                                    }
-                                ]
-                            },
-                            "dataframe": {
-                                "enabled": true,
-                                "fields": [
-                                    {
-                                        "docvalues": true,
-                                        "include_in_all": false,
-                                        "include_term_vectors": false,
-                                        "index": true,
-                                        "name": "dataframe",
-                                        "store": true,
-                                        "analyzer": "keyword",
-                                        "type": "text"
-                                    }
-                                ]
-                            },
-                            "meta": {
-                                "dynamic": true,
-                                "enabled": true,
-                                "properties": {
-                                    "name": {
-                                        "enabled": true,
-                                        "fields": [
-                                            {
-                                                "docvalues": true,
-                                                "include_in_all": false,
-                                                "include_term_vectors": false,
-                                                "index": true,
-                                                "name": "name",
-                                                "store": true,
-                                                "analyzer": "keyword",
-                                                "type": "text"
-                                            }
-                                        ]
-                                    }
-                                }
-                            }
-                        }
-                    }
-                }
-            }
-        }
-    }
+    - Couchbase: Serves as the high-performance vector store
+    - OpenAI: Supplies embeddings and the language model
+
+### Vector Index Creation
+
+You need to create a Hyperscale vector index on your collection **after** loading some documents (required for index training). Choose between BHIVe or Composite Index based on your use case. Whichever vector index (Hyperscale or Composite) you choose won't affect the funbctionality of this demo, though performance differences may occur.
+
+#### Option 1: Hyperscale Vector Index - Recommended
+
+Hyperscale is a dedicated vector index optimized for pure vector search at massive scale. Use this for the best performance in RAG applications. Refer [here](https://docs.couchbase.com/cloud/vector-index/hyperscale-vector-index.html) for detailed instruction.
+
+**Creating a Hyperscale Index using SQL++:**
+
+You can create the index using the Couchbase Query Workbench or programmatically:
+
+```sql
+CREATE VECTOR INDEX idx_pdf_hyperscale
+ON `bucket_name`.`scope_name`.`collection_name`(embedding VECTOR) 
+WITH {
+  "dimension": 1536,           
+  "similarity": "DOT"         
+};
+```
+
+**Index Parameters Explained:**
+- `dimension`: Must match your embedding model (1536 for OpenAI ada-002/ada-003, 768 for sentence-transformers)
+- `similarity`: Must match the similarity metric in `CouchbaseQueryDocumentStore`
+  - `DOT`: Dot product (recommended for OpenAI embeddings)
+
+#### Option 2: Composite Vector Index
+
+Composite indexes combine vector fields with other scalar fields. This is useful when you need to filter documents by metadata before performing vector search.
+
+**Creating a Composite Index using SQL++:**
 
+```sql
+CREATE INDEX idx_pdf_composite 
+ON `bucket_name`.`scope_name`.`collection_name`(embedding VECTOR) 
+USING GSI 
+WITH {
+  "dimension": 1536,
+  "similarity": "DOT"
+};
+```
+
+#### Important Notes
+
+1. **Index Creation Timing**: Hyperscale and Composite vector indexes require training data. Create the index **after** you've loaded the documents into your collection.
+
+2. **Similarity Metric**: The `similarity` parameter in the index **must match** the `similarity` parameter in your `CouchbaseQueryDocumentStore` configuration.
+
+3. **Dimension**: Must match your embedding model's output dimensions.
+
+#### Verifying Your Index
+
+After creating the index, verify it exists:
+
+```sql
+SELECT * FROM system:indexes 
+WHERE name=idx_pdf_composite;
 ```
 
 - #### Run the application
 
   `streamlit run chat_with_pdf.py`
 
-For more details on implementation, refer to the code comments in chat_with_pdf.py.
+### Implementation Details
+
+This demo uses the following key components:
+
+1. **CouchbaseQueryDocumentStore**: GSI-based document store with vector support
+   - Configured with `QueryVectorSearchType.ANN` for fast approximate nearest neighbor search
+   - Uses `QueryVectorSearchSimilarity.DOT` for dot product similarity (recommended for OpenAI embeddings)
+   - Supports both BHIVe and Composite indexes
+
+2. **CouchbaseQueryEmbeddingRetriever**: High-performance retriever for GSI vector search
+   - Uses SQL++ queries with `APPROX_VECTOR_DISTANCE()` function for ANN search
+   - Retrieves top-k most similar documents based on embedding similarity
+
+3. **OpenAI Embeddings**: 
+   - `text-embedding-ada-002` model with 1536 dimensions
+   - Generates embeddings for both documents and queries
+
+For more details on implementation, refer to the extensive code comments in `chat_with_pdf.py`.
+
+### Additional Resources
+
+- [Couchbase Vector Index Documentation](https://docs.couchbase.com/cloud/vector-index/vectors-and-indexes-overview.html)
+- [Haystack Documentation](https://docs.haystack.deepset.ai/)
+- [couchbase-haystack GitHub Repository](https://github.com/Couchbase-Ecosystem/couchbase-haystack)
diff --git a/chat_with_pdf.py b/chat_with_pdf.py
@@ -1,6 +1,7 @@
 import os
 import tempfile
 import streamlit as st
+from datetime import timedelta
 from haystack import Pipeline
 from haystack.components.converters import PyPDFToDocument
 from haystack.components.preprocessors import DocumentCleaner, DocumentSplitter
@@ -9,7 +10,18 @@
 from haystack.components.builders import PromptBuilder, AnswerBuilder
 from haystack.components.writers import DocumentWriter
 from haystack.utils import Secret
-from couchbase_haystack import CouchbaseSearchDocumentStore, CouchbaseSearchEmbeddingRetriever, CouchbasePasswordAuthenticator, CouchbaseClusterOptions
+from couchbase.n1ql import QueryScanConsistency
+
+# Import CouchbaseQueryDocumentStore for GSI-based vector search with BHIVe support
+from couchbase_haystack import (
+    CouchbaseQueryDocumentStore,
+    CouchbaseQueryEmbeddingRetriever,
+    CouchbasePasswordAuthenticator,
+    CouchbaseClusterOptions,
+    QueryVectorSearchType,
+    QueryVectorSearchSimilarity,
+    CouchbaseQueryOptions
+)
 
 def check_environment_variable(variable_name):
     """Check if environment variable is set"""
@@ -31,8 +43,8 @@ def save_to_vector_store(uploaded_file, indexing_pipeline):
 
 @st.cache_resource(show_spinner="Connecting to Vector Store")
 def get_document_store():
-    """Return the Couchbase document store"""
-    return CouchbaseSearchDocumentStore(
+    """Return the Couchbase document store using CouchbaseQueryDocumentStore."""
+    return CouchbaseQueryDocumentStore(
         cluster_connection_string=Secret.from_env_var("DB_CONN_STR"),
         authenticator=CouchbasePasswordAuthenticator(
             username=Secret.from_env_var("DB_USERNAME"),
@@ -42,7 +54,13 @@ def get_document_store():
         bucket=os.getenv("DB_BUCKET"),
         scope=os.getenv("DB_SCOPE"),
         collection=os.getenv("DB_COLLECTION"),
-        vector_search_index=os.getenv("INDEX_NAME"),
+        search_type=QueryVectorSearchType.ANN,
+        similarity=QueryVectorSearchSimilarity.DOT,
+        nprobes=10,
+        query_options=CouchbaseQueryOptions(
+            timeout=timedelta(seconds=60),
+            scan_consistency=QueryScanConsistency.NOT_BOUNDED
+        )
     )
 
 
@@ -57,7 +75,7 @@ def get_document_store():
     )
 
     # Load and check environment variables
-    env_vars = ["DB_CONN_STR", "DB_USERNAME", "DB_PASSWORD", "DB_BUCKET", "DB_SCOPE", "DB_COLLECTION", "INDEX_NAME", "OPENAI_API_KEY"]
+    env_vars = ["DB_CONN_STR", "DB_USERNAME", "DB_PASSWORD", "DB_BUCKET", "DB_SCOPE", "DB_COLLECTION", "OPENAI_API_KEY"]
     for var in env_vars:
         check_environment_variable(var)
 
@@ -80,7 +98,7 @@ def get_document_store():
     # Create RAG pipeline
     rag_pipeline = Pipeline()
     rag_pipeline.add_component("query_embedder", OpenAITextEmbedder())
-    rag_pipeline.add_component("retriever", CouchbaseSearchEmbeddingRetriever(document_store=document_store))
+    rag_pipeline.add_component("retriever", CouchbaseQueryEmbeddingRetriever(document_store=document_store))
     rag_pipeline.add_component("prompt_builder", PromptBuilder(template="""
     You are a helpful bot. If you cannot answer based on the context provided, respond with a generic answer. Answer the question as truthfully as possible using the context below:
     {% for doc in documents %}
@@ -109,7 +127,7 @@ def get_document_store():
     couchbase_logo = "https://emoji.slack-edge.com/T024FJS4M/couchbase/4a361e948b15ed91.png"
 
     st.title("Chat with PDF")
-    st.markdown("Answers with [Couchbase logo](https://emoji.slack-edge.com/T024FJS4M/couchbase/4a361e948b15ed91.png) are generated using *RAG* while 🤖 are generated by pure *LLM (Gemini)*")
+    st.markdown("Answers with [Couchbase logo](https://emoji.slack-edge.com/T024FJS4M/couchbase/4a361e948b15ed91.png) are generated using *RAG* while 🤖 are generated by pure *LLM (OpenAI)*")
 
     with st.sidebar:
         st.header("Upload your PDF")
@@ -123,10 +141,10 @@ def get_document_store():
         st.markdown("""
             For each question, you will get two answers: 
             * one using RAG ([Couchbase logo](https://emoji.slack-edge.com/T024FJS4M/couchbase/4a361e948b15ed91.png))
-            * one using pure LLM - Gemini (🤖). 
+            * one using pure LLM - OpenAI (🤖). 
             """)
 
-        st.markdown("For RAG, we are using [Haystack](https://haystack.deepset.ai/), [Couchbase Vector Search](https://couchbase.com/) & [Gemini](https://gemini.google.com/). We fetch parts of the PDF relevant to the question using Vector search & add it as the context to the LLM. The LLM is instructed to answer based on the context from the Vector Store.")
+        st.markdown("For RAG, we are using [Haystack](https://haystack.deepset.ai/), [Couchbase Vector Search](https://docs.couchbase.com/cloud/vector-index/hyperscale-vector-index.html) & [OpenAI](https://openai.com/). We fetch parts of the PDF relevant to the question using high-performance GSI vector search & add it as the context to the LLM. The LLM is instructed to answer based on the context from the Vector Store.")
 
     if "messages" not in st.session_state:
         st.session_state.messages = []

diff --git a/requirements.txt b/requirements.txt
@@ -1,4 +1,4 @@
 streamlit==1.42.2
-couchbase-haystack==2.0.0
+couchbase-haystack==2.1.0
 pypdf==5.3.0
 nltk>=3.9.1