Skip to content

Commit 9f00a6a

Browse files
standardized the usage of Auto Vectorization term
linked vectorstore construction to LangChain Couchbase docs removed Couchbase SDK as dependency
1 parent eba9531 commit 9f00a6a

File tree

1 file changed

+23
-19
lines changed

1 file changed

+23
-19
lines changed

autovec_unstructured/autovec_unstructured.ipynb

Lines changed: 23 additions & 19 deletions
Original file line numberDiff line numberDiff line change
@@ -8,7 +8,7 @@
88
},
99
"source": [
1010
"# Auto-Vectorization on Unstructured Data Stored in S3 Buckets Using Couchbase Capella AI Services \n",
11-
"This comprehensive tutorial demonstrates how to use Couchbase Capella's new AI Services auto-vectorization feature to automatically convert your unstructured data stored in S3 buckets to import it in Capella and convert it into vector embeddings and perform semantic search using LangChain."
11+
"This comprehensive tutorial demonstrates how to use Couchbase Capella's new AI Services Auto-Vectorization feature to automatically convert your unstructured data stored in S3 buckets to import it in Capella and convert it into vector embeddings and perform semantic search using LangChain."
1212
]
1313
},
1414
{
@@ -78,7 +78,7 @@
7878
" \n",
7979
" <img src=\"./img/workflow.png\" width=\"1000px\" height=\"500px\" style=\"padding: 5px; border-radius: 10px 20px 30px 40px; border: 2px solid #555;\">\n",
8080
" \n",
81-
"2) Start your workflow deployment by giving it a name and selecting where your data will be provided to the auto-vectorization service. There are currently three options: <B>`pre-processed data (JSON format) from Capella`</B>, <B>`pre-processed data (JSON format) from external sources (S3 buckets)`</B> and <B>`unstructured data from external sources (S3 buckets)`</B>. For this tutorial, we will choose the third option, which is unstructured data from external sources (S3 buckets). After selecting the workflow enter the workflow name and click on <B>`Start Workflow`</B>.\n",
81+
"2) Start your workflow deployment by giving it a name and selecting where your data will be provided to the Auto-Vectorization service. There are currently three options: <B>`pre-processed data (JSON format) from Capella`</B>, <B>`pre-processed data (JSON format) from external sources (S3 buckets)`</B> and <B>`unstructured data from external sources (S3 buckets)`</B>. For this tutorial, we will choose the third option, which is unstructured data from external sources (S3 buckets). After selecting the workflow enter the workflow name and click on <B>`Start Workflow`</B>.\n",
8282
" \n",
8383
" <img src=\"./img/start_workflow.png\" width=\"1000px\" height=\"500px\" style=\"padding: 5px; border-radius: 10px 20px 30px 40px; border: 2px solid #555;\">\n",
8484
"\n",
@@ -103,7 +103,9 @@
103103
"6) On selection of the S3 bucket, various options will be displayed as described below.\n",
104104
"\n",
105105
" <img src=\"./img/configure_data_source.png\" width=\"900px\" height=\"500px\" style=\"padding: 5px; border-radius: 10px 20px 30px 40px; border: 2px solid #555;\">\n",
106-
"- `Index Configuration` allows us to create a search index on the generated embeddings of the imported data. If it's skipped then the functionality of vector searching will not be enabled and you need to create index later on.\n",
106+
"- <B>`Index Configuration`</B> allows the workflow to **automatically create a Search index** on the generated embeddings. This Search index is essential for performing vector similarity searches. \n",
107+
" - If you enable this option (recommended), the workflow will create a properly configured Search index that includes vector field mappings for your embeddings.\n",
108+
" - If you skip this step, you'll need to manually create a Search index later before you can perform vector searches. See the [Search Index Creation Guide](https://docs.couchbase.com/server/current/search/create-search-indexes.html) below for manual setup instructions.\n",
107109
"- `Destination Cluster` helps choose the cluster, bucket, scope and collection in which the data needs to be imported.\n",
108110
"- `Estimated Cost` dialogue box in blue color(on the right) will show you the cost of operation per document.\n",
109111
"- Click on `Next`.\n",
@@ -132,7 +134,7 @@
132134
"\n",
133135
" <img src=\"./img/workflow_summary.png\" width=\"800px\" height=\"500px\" style=\"padding: 5px; border-radius: 10px 20px 30px 40px; border: 2px solid #555;\">\n",
134136
"\n",
135-
"10) <B>`Hurray! Workflow Deployed`</B> Now in the `workflow` tab we can see our workflow deployed and can check the status of our workflow. The status of the workflow run will be shown over here.\n",
137+
"10) <B>`Workflow Deployed`</B> Now in the `workflow` tab we can see our workflow deployed and can check the status of our workflow. The status of the workflow run will be shown over here.\n",
136138
"\n",
137139
" <img src=\"./img/workflow_deployed.png\" width=\"950px\" height=\"350px\" style=\"padding: 5px; border-radius: 10px 20px 30px 40px; border: 2px solid #555;\">\n",
138140
"\n",
@@ -145,11 +147,11 @@
145147
"id": "4f7321a7",
146148
"metadata": {},
147149
"source": [
148-
"# 4. Vector Search\n",
150+
"# 4. Vector Search Using Couchbase Search Service\n",
149151
"\n",
150-
"The following code cells implement semantic vector search against the embeddings generated by the AutoVectorization workflow. \n",
152+
"The following code cells implement semantic vector search against the embeddings generated by the Auto-Vectorization workflow. These searches are powered by **Couchbase's Search service**.\n",
151153
"\n",
152-
"Before you proceed, make sure the following packages are installed by running: "
154+
"Before you proceed, make sure the following packages are installed by running:"
153155
]
154156
},
155157
{
@@ -163,16 +165,15 @@
163165
},
164166
"outputs": [],
165167
"source": [
166-
"!pip install couchbase langchain-couchbase langchain-openai"
168+
"!pip install langchain-couchbase langchain-openai"
167169
]
168170
},
169171
{
170172
"cell_type": "markdown",
171173
"id": "ea920e0f-bd81-4a74-841a-86a11cb8aec4",
172174
"metadata": {},
173175
"source": [
174-
"`couchbase - Version: 4.4.0` \\\n",
175-
"`langchain-couchbase - Version: 0.4.0` \\\n",
176+
"`langchain-couchbase - Version: 0.5.0` \\\n",
176177
"`pip install langchain-openai - Version: 0.3.34` \n",
177178
"\n",
178179
"Now, please proceed to execute the cells in order to run the vector similarity search.\n",
@@ -229,7 +230,7 @@
229230
"source": [
230231
"# Selection of Buckets / Scope / Collection / Index / Embedder\n",
231232
" - Sets the bucket, scope, and collection where the documents (with vector fields) live.\n",
232-
" - `index_name` specifies the Capella Search index name.\n",
233+
" - `index_name` specifies the **Capella Search index name**. This is the Search index created automatically during the workflow setup (step 3.6) or manually as described in the Search Index Creation section. You can find this index name in the **Search** tab of your Capella cluster.\n",
233234
" - `embedder` instantiates the NVIDIA embedding model that will transform the user's natural language query into a vector at search time.\n",
234235
" - `open_api_key` is the api key token created in `step 2.3`.\n",
235236
" - `open_api_base` is the Capella model services endpoint found in the models section.\n",
@@ -247,7 +248,7 @@
247248
"bucket_name = \"Unstructured_data_bucket\"\n",
248249
"scope_name = \"_default\"\n",
249250
"collection_name = \"_default\"\n",
250-
"index_name = \"hyperscale_autovec_workflow_text-embedding\" # This is the name of the search index that was created in step 3.6 and can also be seen in the search tab of the cluster.\n",
251+
"index_name = \"hyperscale_autovec_workflow_text-to-embed\" # This is the name of the search index that was created in step 3.6 and can also be seen in the search tab of the cluster.\n",
251252
" \n",
252253
"# Using the OpenAI SDK for the embeddings with the capella model services and they are compatible with the OpenAIEmbeddings class in Langchain\n",
253254
"embedder = OpenAIEmbeddings(\n",
@@ -265,11 +266,13 @@
265266
"metadata": {},
266267
"source": [
267268
"# VectorStore Construction\n",
268-
" - Creates a `CouchbaseSearchVectorStore` instance that:\n",
269+
" - Creates a [CouchbaseSearchVectorStore](https://couchbase-ecosystem.github.io/langchain-couchbase/langchain_couchbase.html#couchbase-search-vector-store) instance that interfaces with **Couchbase's Search service** to perform vector similarity searches.\n",
270+
" - The vector store:\n",
269271
" * Knows where to read documents (`bucket/scope/collection`).\n",
270-
" * Knows the embedding field (the vector produced by the AutoVectorization workflow).\n",
271-
" * Uses the provided embedder to embed queries on-demand.\n",
272-
" - If your AutoVectorization workflow produced a different vector field name, update `embedding_key` accordingly.\n",
272+
" * References the Search index (`index_name`) that contains vector field mappings.\n",
273+
" * Knows the embedding field (the vector produced by the Auto-Vectorization workflow).\n",
274+
" * Uses the provided embedder to embed queries on-demand for similarity search.\n",
275+
" - If your Auto-Vectorization workflow produced a different vector field name, update `embedding_key` accordingly.\n",
273276
" - If you mapped multiple fields into a single vector, you can choose any representative field for `text_key`, or modify the VectorStore wrapper to concatenate fields."
274277
]
275278
},
@@ -299,7 +302,8 @@
299302
"source": [
300303
"# Performing a Similarity Search\n",
301304
" - Defines a natural language query (e.g., \"How to setup java SDK?\").\n",
302-
" - Calls `similarity_search(k=3)` to retrieve the top 3 most semantically similar documents.\n",
305+
" - Calls `similarity_search_with_score(k=3)` to retrieve the top 3 most semantically similar documents using **Couchbase's Search service**.\n",
306+
" - The Search service performs efficient vector similarity search using the index created earlier.\n",
303307
" - Prints ranked results, extracting the chosen `text_key` (here `text-to-embed`).\n",
304308
" - Change `query` to any descriptive phrase (e.g., \"beach resort\", \"airport hotel near NYC\").\n",
305309
" - Adjust `k` for more or fewer results."
@@ -346,13 +350,13 @@
346350
"- Leading number (1, 2, 3): The result rank (1 = most similar to your query).\n",
347351
"- Content text: This is the value of the field you configured as `text_key` (in this tutorial: `text-to-embed`). It represents the human-readable content we chose to display.\n",
348352
"\n",
349-
"### How the Ranking Works\n",
353+
"### How the Ranking Works with Search Service\n",
350354
"1. Your natural language query (e.g., `query = \"How to setup java SDK?\"`) is embedded using the NVIDIA model (`nvidia/llama-3.2-nv-embedqa-1b-v2`).\n",
351355
"2. The vector store compares the query embedding to stored document embeddings in the field you configured (`embedding_key = \"text-embedding\"`).\n",
352356
"3. Results are sorted by vector similarity. Higher similarity = closer semantic meaning.\n",
353357
"\n",
354358
"\n",
355-
"> Your vector search pipeline is working if the returned documents feel meaningfully related to your natural language query—even when exact keywords do not match. Feel free to experiment with increasingly descriptive queries to observe the semantic power of the embeddings."
359+
"> Your vector search pipeline is working if the returned documents feel meaningfully related to your natural language query—even when exact keywords do not match. Feel free to experiment with increasingly descriptive queries to observe the semantic power of the embeddings powered by Couchbase's Search service."
356360
]
357361
}
358362
],

0 commit comments

Comments
 (0)