diff --git a/api-reference/workflow/destinations/ibm-watsonxdata.mdx b/api-reference/workflow/destinations/ibm-watsonxdata.mdx index aa9cf952..c9905fc2 100644 --- a/api-reference/workflow/destinations/ibm-watsonxdata.mdx +++ b/api-reference/workflow/destinations/ibm-watsonxdata.mdx @@ -2,6 +2,17 @@ title: IBM watsonx.data --- + + The IBM watsonx.data destination connector relies on an Apache Iceberg-based catalog within the watsonx.data data store instance. + Apache Iceberg is suitable for managed data storage and cataloging, but not for embedding storage or semantic similarity + queries. For embedding storage and semantic similarity queries, Unstructured recommends that you use the following destination connectors + instead: + + - [Astra DB](/api-reference/workflow/destinations/astradb) + - [Milvus](/api-reference/workflow/destinations/milvus) on IBM watsonx.data + + + import FirstTimeAPIDestinationConnector from '/snippets/general-shared-text/first-time-api-destination-connector.mdx'; diff --git a/open-source/ingestion/destination-connectors/ibm-watsonxdata.mdx b/open-source/ingestion/destination-connectors/ibm-watsonxdata.mdx index 2135acf6..9b97c33e 100644 --- a/open-source/ingestion/destination-connectors/ibm-watsonxdata.mdx +++ b/open-source/ingestion/destination-connectors/ibm-watsonxdata.mdx @@ -2,6 +2,17 @@ title: IBM watsonx.data --- + + The IBM watsonx.data destination connector relies on an Apache Iceberg-based catalog within the watsonx.data data store instance. + Apache Iceberg is suitable for managed data storage and cataloging, but not for embedding storage or semantic similarity + queries. For embedding storage and semantic similarity queries, Unstructured recommends that you use the following destination connectors + instead: + + - [Astra DB](/open-source/ingestion/destination-connectors/astradb) + - [Milvus](/open-source/ingestion/destination-connectors/milvus) on IBM watsonx.data + + + import SharedIBMWatsonxdata from '/snippets/dc-shared-text/ibm-watsonxdata-cli-api.mdx'; diff --git a/snippets/general-shared-text/astradb-api-placeholders.mdx b/snippets/general-shared-text/astradb-api-placeholders.mdx index 0368ae40..45697f5e 100644 --- a/snippets/general-shared-text/astradb-api-placeholders.mdx +++ b/snippets/general-shared-text/astradb-api-placeholders.mdx @@ -1,7 +1,7 @@ - `` (_required_) - A unique name for this connector. - `` (_required_) - The application token for the database. -- `` (_required_) - The database’s associated API endpoint. -- `` - The name of the collection in the namespace. If no value is provided, see the beginning of this article for the behavior at run time. +- `` (_required_) - The database's associated API endpoint. +- `` - The name of the collection in the keyspace. If no value is provided, see the beginning of this article for the behavior at run time. - `` - The name of the keyspace in the collection. The default is `default_keyspace` if not otherwise specified. - `` - The maximum number of records to send per batch. The default is `20` if not otherwise specified. - `flatten_metadata` - Set to `true` to flatten the metadata into each record. Specifically, when flattened, the metadata key values are brought to the top level of the element, and the `metadata` key itself is removed. By default, the metadata is not flattened (`false`). diff --git a/snippets/general-shared-text/astradb-platform.mdx b/snippets/general-shared-text/astradb-platform.mdx index 3e4a79fd..8ff44f9b 100644 --- a/snippets/general-shared-text/astradb-platform.mdx +++ b/snippets/general-shared-text/astradb-platform.mdx @@ -1,7 +1,7 @@ Fill in the following fields: - **Name** (_required_): A unique name for this connector. -- **Collection Name**: The name of the collection in the namespace. If no value is provided, see the beginning of this article for the behavior at run time. +- **Collection Name**: The name of the collection in the keyspace. If no value is provided, see the beginning of this article for the behavior at run time. - **Keyspace** (_required_): The name of the keyspace in the collection. - **Batch Size**: The maximum number of records per batch. The default is `20` if not otherwise specified. - **Flatten Metadata**: Check this box to flatten the metadata into each record. diff --git a/snippets/general-shared-text/astradb.mdx b/snippets/general-shared-text/astradb.mdx index b2a8c1ec..91d4b3c8 100644 --- a/snippets/general-shared-text/astradb.mdx +++ b/snippets/general-shared-text/astradb.mdx @@ -8,26 +8,70 @@ allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; pic allowfullscreen > -- An Astra account. [Create or sign in to an Astra account](https://astra.datastax.com/). -- A database in the Astra account. [Create a database in an account](https://docs.datastax.com/en/astra-db-classic/databases/manage-create.html). -- An application token for the database. [Create a database application token](https://docs.datastax.com/en/astra-db-serverless/administration/manage-application-tokens.html). -- A namespace in the database. [Create a namespace in a database](https://docs.datastax.com/en/astra-db-serverless/databases/manage-namespaces.html#create-namespace). -- A collection in the namespace. [Create a collection in a namespace](https://docs.datastax.com/en/astra-db-serverless/databases/manage-collections.html#create-collection). +- An IBM Cloud account or DataStax account. - An existing collection is not required. At runtime, the collection behavior is as follows: + - For an IBM Cloud account, [sign up](https://cloud.ibm.com/registration) for an IBMid, and then [sign in](https://accounts.datastax.com/session-service/v1/login) to DataStax with your IBMid. + - For a DataStax account, [sign up](https://astra.datastax.com/signup) for a DataStax account, and then [sign in](https://accounts.datastax.com/session-service/v1/login) to DataStax with your DataStax account. + +- An Astra DB database in the DataStax account. To create a database: + + a. After you sign in to DataStax, click **Create database**.
+ b. Click the **Serverless (vector)** tile, if it is not already selected.
+ c. For **Database name**, enter some unique name for the database.
+ d. Select a **Provider** and a **Region**, and then click **Create database**.
+ + [Learn more](https://docs.datastax.com/en/astra-db-classic/databases/manage-create.html). + +- An application token for the database. To create an application token: + + a. After you sign in to DataStax, in the list of databases, click the name of the target database.
+ b. On the **Overview** tab, under **Database Details**, in the **Application Tokens** tile, click **Generate Token**.
+ c. Enter some **Token description** and select and **Expiration** time period, and then click **Generate token**.
+ d. Save the application token that is displayed to a secure location, and then click **Close**.
+ + [Learn more](https://docs.datastax.com/en/astra-db-serverless/administration/manage-application-tokens.html). + +- A keyspace in the database. To create a keyspace: + + a. After you sign in to DataStax, in the list of databases, click the name of the target database.
+ b. On the **Data Explorer** tab, in the **Keyspace** list, select **Create keyspace**.
+ c. Enter some **Keyspace name**, and then click **Add keyspace**.
+ + [Learn more](https://docs.datastax.com/en/astra-db-serverless/databases/manage-keyspaces.html#keyspaces). + +- A collection in the keyspace. For the [Unstructured UI](/ui/overview) and [Unstructured API](/api-reference/overview): - - If an existing collection name is specified, and Unstructured generates embeddings, - but the number of dimensions that are generated does not match the existing collection's embedding settings, the run will fail. - You must change your Unstructured embedding settings or your existing collection's embedding settings to match, and try the run again. - - If a collection name is not specified, Unstructured creates a new collection in your namespace. If Unstructured generates embeddings, - the new collections's name will be `u__`. - If Unstructured does not generate embeddings, the new collections's name will be `u__`. + If Unstructured does not generate embeddings, the new collections's name will be `u + b. On the **Data Explorer** tab, in the **Keyspace** list, select the name of the target keyspace.
+ c. In the **Collections** list, select **Create collection**.
+ d. Enter some **Collection name**.
+ e. Turn on **Vector-enabled collection**, if it is not already turned on.
+ f. For **Embedding generation method**, select **Bring my own**.
+ g. For **Dimensions**, enter the number of dimensions for the embedding model that you plan to use.
+ h. For **Similarity metric**, select **Cosine**.
+ i. Click **Create collection**.
+ + [Learn more](https://docs.datastax.com/en/astra-db-serverless/databases/manage-collections.html#create-collection). \ No newline at end of file diff --git a/snippets/general-shared-text/milvus.mdx b/snippets/general-shared-text/milvus.mdx index 52055b36..6db481cc 100644 --- a/snippets/general-shared-text/milvus.mdx +++ b/snippets/general-shared-text/milvus.mdx @@ -1,7 +1,61 @@ -- For the [Unstructured UI](/ui/overview) or the [Unstructured API](/api-reference/overview), only Milvus cloud-based instances (such as Zilliz Cloud, and Milvus on IBM watsonx.data) are supported. +- For the [Unstructured UI](/ui/overview) or the [Unstructured API](/api-reference/overview), only Milvus cloud-based instances (such as Milvus on IBM watsonx.data, or Zilliz Cloud) are supported. - For [Unstructured Ingest](/open-source/ingestion/overview), Milvus local and cloud-based instances are supported. -The following video shows how to fulfill the minimum set of requirements for Milvus cloud-based instances, demonstrating Milvus on IBM watsonx.data: +- For Milvus on IBM watsonx.data, you will need: + + + + - An [IBM Cloud account](https://cloud.ibm.com/registration). + - An IBM watsonx.data [Lite plan](https://cloud.ibm.com/docs/watsonxdata?topic=watsonxdata-tutorial_prov_lite_1) + or [Enterprise plan](https://cloud.ibm.com/docs/watsonxdata?topic=watsonxdata-getting-started_1) within your IBM Cloud account. + + - If you are provisoning a Lite plan, be sure to choose the **Generative AI** use case when prompted, as this is the only use case offered that includes Milvus. + + - A [Milvus service instance in IBM watsonx.data](https://cloud.ibm.com/docs/watsonxdata?topic=watsonxdata-adding-milvus-service). + + - If you are creating a Milvus service instance within a watsonx.data Lite plan, when you are prompted to choose a Milvus instance size, you can only select **Lite**. Because the Lite + Milvus instance size is recommended only for 384 dimensions, you should also use an embedding model that uses 384 dimensions only. + - If you are creating a Milvus service instance within a watsonx.data Enterprise plan, you can choose any available Milvus instance size. However, all Milvus instance sizes other than + **Custom** are recommended only for 384 dimensions, which means you should use an embedding model that uses 384 dimensions only. + The **Custom** Milvus instance size is recommended for any number of dimensions. + + - The URI of the instance, which takes the format of `https://`, followed by instance's **GRPC host**, followed by a colon and the **GRPC port**. + This takes the format of `https://:`. To get this informatation, do the following: + + a. Sign in to your IBM Cloud account.
+ b. On the sidebar, click the **Resource list** icon. If the sidebar is not visible, click the **Navigation Menu** icon to the far left of the title bar.
+ c. Expand **Databases**, and then click the name of the target **watsonx.data** plan.
+ d. Click **Open web console**.
+ e. On the sidebar, click **Infrastructure manager**. If the sidebar is not visible, click the **Global navigation** icon to the far left of the title bar.
+ f. Click the target Milvus service instance.
+ g. On the **Details** tab, under **Type**, click **View connect details**.
+ h. Under **Service details**, expand **GRPC**, and note the value of **GRPC host** and **GRPC port**.
+ + - The name of the [database](https://milvus.io/docs/manage_databases.md) in the instance. + - The name of the [collection](https://milvus.io/docs/manage-collections.md) in the database. Note the collection requirements at the end of this section. + - The username and password to access the instance. + + - The username for Milvus on IBM watsonx.data is always `ibmlhapikey`. + - The password for Milvus on IBM watsonx.data is in the form of an IBM Cloud user API key. To create an IBM Cloud user API key: + + a. Sign in to your IBM Cloud account.
+ b. In the title bar, click **Manage** and then, under **Security and access**, click **Access (IAM)**.
+ c. On the sidebar, under **Manage identities**, click **API keys**. If the sidebar is not visible, click the **Navigation Menu** icon to the far left of the title bar.
+ d. Click **Create**.
+ e. Enter some **Name** for the API key.
+ f. Optionally, enter some **Description** for the API key.
+ g. For **Leaked action**, leave **Disable the leaked key** selected.
+ h. For **Session management**, leave **No** selected.
+ i. Click **Create**.
+ j. Click **Download** (or **Copy**), and then download the API key to a secure location (or paste the copied API key into a secure location). You won't be able to access this API key from this dialog again. If you lose this API key, you can create a new one (and you should then delete the old one).
- For Zilliz Cloud, you will need: @@ -54,31 +108,6 @@ The following video shows how to fulfill the minimum set of requirements for Mil The number of dimensions for the `embeddings` field must match the number of dimensions for the embedding model that you plan to use. -- For Milvus on IBM watsonx.data, you will need: - - - - - An [IBM Cloud account](https://cloud.ibm.com/registration). - - The [IBM watsonx.data subscription plan](https://cloud.ibm.com/docs/watsonxdata?topic=watsonxdata-getting-started). - - A [Milvus service instance in IBM watsonx.data](https://cloud.ibm.com/docs/watsonxdata?topic=watsonxdata-adding-milvus-service). - - The URI of the instance, which takes the format of `https://`, followed by instance's **GRPC host**, followed by a colon and the **GRPC port**. - This takes the format of `https://:`. - [Get the instance's GRPC host and GRPC port](https://cloud.ibm.com/docs/watsonxdata?topic=watsonxdata-conn-to-milvus). - - The name of the [database](https://milvus.io/docs/manage_databases.md) in the instance. - - The name of the [collection](https://milvus.io/docs/manage-collections.md) in the database. Note the collection requirements at the end of this section. - - The username and password to access the instance. - The username for Milvus on IBM watsonx.data is always `ibmlhapikey`. - The password for Milvus on IBM watsonx.data is in the form of an IBM Cloud user API key. - [Get the user API key](https://cloud.ibm.com/docs/account?topic=account-userapikey&interface=ui). - - For Milvus local, you will need: - A [Milvus instance](https://milvus.io/docs/install-overview.md). @@ -89,7 +118,9 @@ The following video shows how to fulfill the minimum set of requirements for Mil - The [username and password, or token](https://milvus.io/docs/authenticate.md) to access the instance. All Milvus instances require the target collection to have a defined schema before Unstructured can write to the collection. The minimum viable -schema for Unstructured contains only the fields `element_id`, `embeddings`, `record_id`, and `text`, as follows. This example code demonstrates the use of the +schema for Unstructured contains only the fields `element_id`, `embeddings`, `record_id`, and `text`, as follows. + +This example code demonstrates the use of the [Python SDK for Milvus](https://pypi.org/project/pymilvus/) to create a collection with this schema, targeting Milvus on IBM watsonx.data. For the `MilvusClient` arguments to connect to other types of Milvus deployments, see your Milvus provider's documentation: diff --git a/ui/destinations/ibm-watsonxdata.mdx b/ui/destinations/ibm-watsonxdata.mdx index a2d128a1..667cb8f0 100644 --- a/ui/destinations/ibm-watsonxdata.mdx +++ b/ui/destinations/ibm-watsonxdata.mdx @@ -2,6 +2,17 @@ title: IBM watsonx.data --- + + The IBM watsonx.data destination connector relies on an Apache Iceberg-based catalog within the watsonx.data data store instance. + Apache Iceberg is suitable for managed data storage and cataloging, but not for embedding storage or semantic similarity + queries. For embedding storage and semantic similarity queries, Unstructured recommends that you use the following destination connectors + instead: + + - [Astra DB](/ui/destinations/astradb) + - [Milvus](/ui/destinations/milvus) on IBM watsonx.data + + + import FirstTimeUIDestinationConnector from '/snippets/general-shared-text/first-time-ui-destination-connector.mdx';