|
707 | 707 | "source": [
|
708 | 708 | "Next, we need to **define the inputs** for the processing job.\n",
|
709 | 709 | "\n",
|
710 |
| - "To process the whole `data/raw` corpus, you could simply pass the whole `data/raw` prefix in S3 as input to the job (As shown in the commented-out *Option 2* below) and scale up the `instance_count` to complete the work quickly.\n", |
| 710 | + "To process the whole `data/raw` corpus, you could simply pass the whole `data/raw` prefix in S3 as input to the job (As shown in the commented-out *Option 2* below) and scale up the job's compute resources to complete the work quickly.\n", |
711 | 711 | "\n",
|
712 | 712 | "To process just a sample subset of files for speed in our demo, we'll create a **manifest file** listing just the documents we want.\n",
|
713 | 713 | "\n",
|
714 |
| - "> ⚠️ **Note:** 'Non-augmented' manifest files are still JSON-based, but a different format from the other dataset manifests we'll be using through this sample. You can find guidance for manifests as used here on the [S3DataSource API doc](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_S3DataSource.html), and separate information on the \"augmented\" manifests as used later with SageMaker Ground Truth in the [Ground Truth documentation](https://docs.aws.amazon.com/sagemaker/latest/dg/sms-input-data-input-manifest.html)." |
| 714 | + "> ⚠️ **Note:** 'Non-augmented' manifest files are still JSON-based, but a different format from the other dataset manifests we'll be using through this sample. You can find guidance on the [S3DataSource API doc](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_S3DataSource.html) for manifests as used here, and separate information in the [Ground Truth documentation](https://docs.aws.amazon.com/sagemaker/latest/dg/sms-input-data-input-manifest.html) on the \"augmented\" manifests as used later with SageMaker Ground Truth." |
715 | 715 | ]
|
716 | 716 | },
|
717 | 717 | {
|
|
768 | 768 | "cell_type": "markdown",
|
769 | 769 | "metadata": {},
|
770 | 770 | "source": [
|
771 |
| - "The script we'll be using to process the documents is in the same folder as the Dockerfile used earlier to build the container image: [preproc/imgclean.py](preproc/imageclean.py).\n", |
| 771 | + "The script we'll be using to process the documents is in the same folder as the Dockerfile used earlier to build the container image: [preproc/imgclean.py](preproc/imgclean.py).\n", |
772 | 772 | "\n",
|
773 |
| - "The code parallelizes processing across available CPUs, and the `ShardedByS3Key` setting used on our [ProcessingInput](https://sagemaker.readthedocs.io/en/stable/api/training/processing.html#sagemaker.processing.ProcessingInput) above distributes documents between instances if muliple are provided - so you should be able to `instance_type` and `instance_count` of the job if needed to take advantage of what resources you have available. The process is typically CPU-bound, so the `ml.c*` families are likely a good fit.\n", |
| 773 | + "The code parallelizes processing across available CPUs, and the `ShardedByS3Key` setting used on our [ProcessingInput](https://sagemaker.readthedocs.io/en/stable/api/training/processing.html#sagemaker.processing.ProcessingInput) above distributes documents between instances if multiple are provided. This means you should be able to adjust both `instance_type` and `instance_count` below, and still take advantage of the resources configured. The process is typically CPU-bound, so the `ml.c*` families are likely a good fit.\n", |
774 | 774 | "\n",
|
775 | 775 | "The cell below will **run the processing job** and show logs from the job as it progresses. You can also check up on the status and history of jobs in the [Processing page of the Amazon SageMaker Console](https://console.aws.amazon.com/sagemaker/home?#/processing-jobs).\n",
|
776 | 776 | "\n",
|
|
0 commit comments