Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
11 changes: 11 additions & 0 deletions api-reference/workflow/destinations/ibm-watsonxdata.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,17 @@
title: IBM watsonx.data
---

<Tip>
The IBM watsonx.data destination connector relies on an Apache Iceberg-based catalog within the watsonx.data data store instance.
Apache Iceberg is suitable for managed data storage and cataloging, but not for embedding storage or semantic similarity
queries. For embedding storage and semantic similarity queries, Unstructured recommends that you use the following destination connectors
instead:

- [Astra DB](/api-reference/workflow/destinations/astradb)
- [Milvus](/api-reference/workflow/destinations/milvus) on IBM watsonx.data

</Tip>

import FirstTimeAPIDestinationConnector from '/snippets/general-shared-text/first-time-api-destination-connector.mdx';

<FirstTimeAPIDestinationConnector />
Expand Down
11 changes: 11 additions & 0 deletions open-source/ingestion/destination-connectors/ibm-watsonxdata.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,17 @@
title: IBM watsonx.data
---

<Tip>
The IBM watsonx.data destination connector relies on an Apache Iceberg-based catalog within the watsonx.data data store instance.
Apache Iceberg is suitable for managed data storage and cataloging, but not for embedding storage or semantic similarity
queries. For embedding storage and semantic similarity queries, Unstructured recommends that you use the following destination connectors
instead:

- [Astra DB](/open-source/ingestion/destination-connectors/astradb)
- [Milvus](/open-source/ingestion/destination-connectors/milvus) on IBM watsonx.data

</Tip>

import SharedIBMWatsonxdata from '/snippets/dc-shared-text/ibm-watsonxdata-cli-api.mdx';

<SharedIBMWatsonxdata />
Expand Down
4 changes: 2 additions & 2 deletions snippets/general-shared-text/astradb-api-placeholders.mdx
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
- `<name>` (_required_) - A unique name for this connector.
- `<token>` (_required_) - The application token for the database.
- `<api-endpoint>` (_required_) - The databases associated API endpoint.
- `<collection-name>` - The name of the collection in the namespace. If no value is provided, see the beginning of this article for the behavior at run time.
- `<api-endpoint>` (_required_) - The database's associated API endpoint.
- `<collection-name>` - The name of the collection in the keyspace. If no value is provided, see the beginning of this article for the behavior at run time.
- `<keyspace>` - The name of the keyspace in the collection. The default is `default_keyspace` if not otherwise specified.
- `<batch-size>` - The maximum number of records to send per batch. The default is `20` if not otherwise specified.
- `flatten_metadata` - Set to `true` to flatten the metadata into each record. Specifically, when flattened, the metadata key values are brought to the top level of the element, and the `metadata` key itself is removed. By default, the metadata is not flattened (`false`).
2 changes: 1 addition & 1 deletion snippets/general-shared-text/astradb-platform.mdx
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
Fill in the following fields:

- **Name** (_required_): A unique name for this connector.
- **Collection Name**: The name of the collection in the namespace. If no value is provided, see the beginning of this article for the behavior at run time.
- **Collection Name**: The name of the collection in the keyspace. If no value is provided, see the beginning of this article for the behavior at run time.
- **Keyspace** (_required_): The name of the keyspace in the collection.
- **Batch Size**: The maximum number of records per batch. The default is `20` if not otherwise specified.
- **Flatten Metadata**: Check this box to flatten the metadata into each record.
Expand Down
76 changes: 60 additions & 16 deletions snippets/general-shared-text/astradb.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -8,26 +8,70 @@ allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; pic
allowfullscreen
></iframe>

- An Astra account. [Create or sign in to an Astra account](https://astra.datastax.com/).
- A database in the Astra account. [Create a database in an account](https://docs.datastax.com/en/astra-db-classic/databases/manage-create.html).
- An application token for the database. [Create a database application token](https://docs.datastax.com/en/astra-db-serverless/administration/manage-application-tokens.html).
- A namespace in the database. [Create a namespace in a database](https://docs.datastax.com/en/astra-db-serverless/databases/manage-namespaces.html#create-namespace).
- A collection in the namespace. [Create a collection in a namespace](https://docs.datastax.com/en/astra-db-serverless/databases/manage-collections.html#create-collection).
- An IBM Cloud account or DataStax account.

An existing collection is not required. At runtime, the collection behavior is as follows:
- For an IBM Cloud account, [sign up](https://cloud.ibm.com/registration) for an IBMid, and then [sign in](https://accounts.datastax.com/session-service/v1/login) to DataStax with your IBMid.
- For a DataStax account, [sign up](https://astra.datastax.com/signup) for a DataStax account, and then [sign in](https://accounts.datastax.com/session-service/v1/login) to DataStax with your DataStax account.

- An Astra DB database in the DataStax account. To create a database:

a. After you sign in to DataStax, click **Create database**.<br/>
b. Click the **Serverless (vector)** tile, if it is not already selected.<br/>
c. For **Database name**, enter some unique name for the database.<br/>
d. Select a **Provider** and a **Region**, and then click **Create database**.<br/>

[Learn more](https://docs.datastax.com/en/astra-db-classic/databases/manage-create.html).

- An application token for the database. To create an application token:

a. After you sign in to DataStax, in the list of databases, click the name of the target database.<br/>
b. On the **Overview** tab, under **Database Details**, in the **Application Tokens** tile, click **Generate Token**.<br/>
c. Enter some **Token description** and select and **Expiration** time period, and then click **Generate token**.<br/>
d. Save the application token that is displayed to a secure location, and then click **Close**.<br/>

[Learn more](https://docs.datastax.com/en/astra-db-serverless/administration/manage-application-tokens.html).

- A keyspace in the database. To create a keyspace:

a. After you sign in to DataStax, in the list of databases, click the name of the target database.<br/>
b. On the **Data Explorer** tab, in the **Keyspace** list, select **Create keyspace**.<br/>
c. Enter some **Keyspace name**, and then click **Add keyspace**.<br/>

[Learn more](https://docs.datastax.com/en/astra-db-serverless/databases/manage-keyspaces.html#keyspaces).

- A collection in the keyspace.

For the [Unstructured UI](/ui/overview) and [Unstructured API](/api-reference/overview):

- If an existing collection name is specified, and Unstructured generates embeddings,
but the number of dimensions that are generated does not match the existing collection's embedding settings, the run will fail.
You must change your Unstructured embedding settings or your existing collection's embedding settings to match, and try the run again.
- If a collection name is not specified, Unstructured creates a new collection in your namespace. If Unstructured generates embeddings,
the new collections's name will be `u<short-workflow-id>_<short-embedding-model-name>_<number-of-dimensions>`.
If Unstructured does not generate embeddings, the new collections's name will be `u<short-workflow-id`.
- An existing collection is not required. At runtime, the collection behavior is as follows:

- If an existing collection name is specified, and Unstructured generates embeddings,
but the number of dimensions that are generated does not match the existing collection's embedding settings, the run will fail.
You must change your Unstructured embedding settings or your existing collection's embedding settings to match, and try the run again.
- If a collection name is not specified, Unstructured creates a new collection in your keyspace. If Unstructured generates embeddings,
the new collections's name will be `u<short-workflow-id>_<short-embedding-model-name>_<number-of-dimensions>`.
If Unstructured does not generate embeddings, the new collections's name will be `u<short-workflow-id`.

For [Unstructured Ingest](/open-source/ingestion/overview):

- If an existing collection name is specified, and Unstructured generates embeddings,
but the number of dimensions that are generated does not match the existing collection's embedding settings, the run will fail.
You must change your Unstructured embedding settings or your existing collections's embedding settings to match, and try the run again.
- If a collection name is not specified, Unstructured creates a new collection in your Pinecone account. The new collection's name will be `unstructuredautocreated`.
- For the source connector only, an existing collection is required.
- For the destination connector only, an existing collection is not required. At runtime, the collection behavior is as follows:

- If an existing collection name is specified, and Unstructured generates embeddings,
but the number of dimensions that are generated does not match the existing collection's embedding settings, the run will fail.
You must change your Unstructured embedding settings or your existing collections's embedding settings to match, and try the run again.
- If a collection name is not specified, Unstructured creates a new collection in your keyspace. The new collection's name will be `unstructuredautocreated`.

To create a collection yourself:

a. After you sign in to DataStax, in the list of databases, click the name of the target database.<br/>
b. On the **Data Explorer** tab, in the **Keyspace** list, select the name of the target keyspace.<br/>
c. In the **Collections** list, select **Create collection**.<br/>
d. Enter some **Collection name**.<br/>
e. Turn on **Vector-enabled collection**, if it is not already turned on.<br/>
f. For **Embedding generation method**, select **Bring my own**.<br/>
g. For **Dimensions**, enter the number of dimensions for the embedding model that you plan to use.<br/>
h. For **Similarity metric**, select **Cosine**.<br/>
i. Click **Create collection**.<br/>

[Learn more](https://docs.datastax.com/en/astra-db-serverless/databases/manage-collections.html#create-collection).
87 changes: 59 additions & 28 deletions snippets/general-shared-text/milvus.mdx
Original file line number Diff line number Diff line change
@@ -1,7 +1,61 @@
- For the [Unstructured UI](/ui/overview) or the [Unstructured API](/api-reference/overview), only Milvus cloud-based instances (such as Zilliz Cloud, and Milvus on IBM watsonx.data) are supported.
- For the [Unstructured UI](/ui/overview) or the [Unstructured API](/api-reference/overview), only Milvus cloud-based instances (such as Milvus on IBM watsonx.data, or Zilliz Cloud) are supported.
- For [Unstructured Ingest](/open-source/ingestion/overview), Milvus local and cloud-based instances are supported.

The following video shows how to fulfill the minimum set of requirements for Milvus cloud-based instances, demonstrating Milvus on IBM watsonx.data:
- For Milvus on IBM watsonx.data, you will need:

<iframe
width="560"
height="315"
src="https://www.youtube.com/embed/hLCwoe2fCnc"
title="YouTube video player"
frameborder="0"
allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture"
allowfullscreen
></iframe>

- An [IBM Cloud account](https://cloud.ibm.com/registration).
- An IBM watsonx.data [Lite plan](https://cloud.ibm.com/docs/watsonxdata?topic=watsonxdata-tutorial_prov_lite_1)
or [Enterprise plan](https://cloud.ibm.com/docs/watsonxdata?topic=watsonxdata-getting-started_1) within your IBM Cloud account.

- If you are provisoning a Lite plan, be sure to choose the **Generative AI** use case when prompted, as this is the only use case offered that includes Milvus.

- A [Milvus service instance in IBM watsonx.data](https://cloud.ibm.com/docs/watsonxdata?topic=watsonxdata-adding-milvus-service).

- If you are creating a Milvus service instance within a watsonx.data Lite plan, when you are prompted to choose a Milvus instance size, you can only select **Lite**. Because the Lite
Milvus instance size is recommended only for 384 dimensions, you should also use an embedding model that uses 384 dimensions only.
- If you are creating a Milvus service instance within a watsonx.data Enterprise plan, you can choose any available Milvus instance size. However, all Milvus instance sizes other than
**Custom** are recommended only for 384 dimensions, which means you should use an embedding model that uses 384 dimensions only.
The **Custom** Milvus instance size is recommended for any number of dimensions.

- The URI of the instance, which takes the format of `https://`, followed by instance's **GRPC host**, followed by a colon and the **GRPC port**.
This takes the format of `https://<host>:<port>`. To get this informatation, do the following:

a. Sign in to your IBM Cloud account.<br/>
b. On the sidebar, click the **Resource list** icon. If the sidebar is not visible, click the **Navigation Menu** icon to the far left of the title bar.<br/>
c. Expand **Databases**, and then click the name of the target **watsonx.data** plan.<br/>
d. Click **Open web console**.<br/>
e. On the sidebar, click **Infrastructure manager**. If the sidebar is not visible, click the **Global navigation** icon to the far left of the title bar.<br/>
f. Click the target Milvus service instance.<br/>
g. On the **Details** tab, under **Type**, click **View connect details**.<br/>
h. Under **Service details**, expand **GRPC**, and note the value of **GRPC host** and **GRPC port**.<br/>

- The name of the [database](https://milvus.io/docs/manage_databases.md) in the instance.
- The name of the [collection](https://milvus.io/docs/manage-collections.md) in the database. Note the collection requirements at the end of this section.
- The username and password to access the instance.

- The username for Milvus on IBM watsonx.data is always `ibmlhapikey`.
- The password for Milvus on IBM watsonx.data is in the form of an IBM Cloud user API key. To create an IBM Cloud user API key:

a. Sign in to your IBM Cloud account.<br/>
b. In the title bar, click **Manage** and then, under **Security and access**, click **Access (IAM)**.<br/>
c. On the sidebar, under **Manage identities**, click **API keys**. If the sidebar is not visible, click the **Navigation Menu** icon to the far left of the title bar.<br/>
d. Click **Create**.<br/>
e. Enter some **Name** for the API key.<br/>
f. Optionally, enter some **Description** for the API key.<br/>
g. For **Leaked action**, leave **Disable the leaked key** selected.<br/>
h. For **Session management**, leave **No** selected.<br/>
i. Click **Create**.<br/>
j. Click **Download** (or **Copy**), and then download the API key to a secure location (or paste the copied API key into a secure location). You won't be able to access this API key from this dialog again. If you lose this API key, you can create a new one (and you should then delete the old one).<br/>

- For Zilliz Cloud, you will need:

Expand Down Expand Up @@ -54,31 +108,6 @@ The following video shows how to fulfill the minimum set of requirements for Mil
The number of dimensions for the `embeddings` field must match the number of dimensions for the embedding model that you plan to use.
</Warning>

- For Milvus on IBM watsonx.data, you will need:

<iframe
width="560"
height="315"
src="https://www.youtube.com/embed/hLCwoe2fCnc"
title="YouTube video player"
frameborder="0"
allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture"
allowfullscreen
></iframe>

- An [IBM Cloud account](https://cloud.ibm.com/registration).
- The [IBM watsonx.data subscription plan](https://cloud.ibm.com/docs/watsonxdata?topic=watsonxdata-getting-started).
- A [Milvus service instance in IBM watsonx.data](https://cloud.ibm.com/docs/watsonxdata?topic=watsonxdata-adding-milvus-service).
- The URI of the instance, which takes the format of `https://`, followed by instance's **GRPC host**, followed by a colon and the **GRPC port**.
This takes the format of `https://<host>:<port>`.
[Get the instance's GRPC host and GRPC port](https://cloud.ibm.com/docs/watsonxdata?topic=watsonxdata-conn-to-milvus).
- The name of the [database](https://milvus.io/docs/manage_databases.md) in the instance.
- The name of the [collection](https://milvus.io/docs/manage-collections.md) in the database. Note the collection requirements at the end of this section.
- The username and password to access the instance.
The username for Milvus on IBM watsonx.data is always `ibmlhapikey`.
The password for Milvus on IBM watsonx.data is in the form of an IBM Cloud user API key.
[Get the user API key](https://cloud.ibm.com/docs/account?topic=account-userapikey&interface=ui).

- For Milvus local, you will need:

- A [Milvus instance](https://milvus.io/docs/install-overview.md).
Expand All @@ -89,7 +118,9 @@ The following video shows how to fulfill the minimum set of requirements for Mil
- The [username and password, or token](https://milvus.io/docs/authenticate.md) to access the instance.

All Milvus instances require the target collection to have a defined schema before Unstructured can write to the collection. The minimum viable
schema for Unstructured contains only the fields `element_id`, `embeddings`, `record_id`, and `text`, as follows. This example code demonstrates the use of the
schema for Unstructured contains only the fields `element_id`, `embeddings`, `record_id`, and `text`, as follows.

This example code demonstrates the use of the
[Python SDK for Milvus](https://pypi.org/project/pymilvus/) to create a collection with this schema,
targeting Milvus on IBM watsonx.data. For the `MilvusClient` arguments to connect to other types of Milvus deployments, see your Milvus provider's documentation:

Expand Down
11 changes: 11 additions & 0 deletions ui/destinations/ibm-watsonxdata.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,17 @@
title: IBM watsonx.data
---

<Tip>
The IBM watsonx.data destination connector relies on an Apache Iceberg-based catalog within the watsonx.data data store instance.
Apache Iceberg is suitable for managed data storage and cataloging, but not for embedding storage or semantic similarity
queries. For embedding storage and semantic similarity queries, Unstructured recommends that you use the following destination connectors
instead:

- [Astra DB](/ui/destinations/astradb)
- [Milvus](/ui/destinations/milvus) on IBM watsonx.data

</Tip>

import FirstTimeUIDestinationConnector from '/snippets/general-shared-text/first-time-ui-destination-connector.mdx';

<FirstTimeUIDestinationConnector />
Expand Down