Skip to content

Commit 4bee119

Browse files
committed
doc(nbs): tidy up preproc commentary
1 parent c1e54db commit 4bee119

File tree

1 file changed

+4
-4
lines changed

1 file changed

+4
-4
lines changed

notebooks/1. Data Preparation.ipynb

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -707,11 +707,11 @@
707707
"source": [
708708
"Next, we need to **define the inputs** for the processing job.\n",
709709
"\n",
710-
"To process the whole `data/raw` corpus, you could simply pass the whole `data/raw` prefix in S3 as input to the job (As shown in the commented-out *Option 2* below) and scale up the `instance_count` to complete the work quickly.\n",
710+
"To process the whole `data/raw` corpus, you could simply pass the whole `data/raw` prefix in S3 as input to the job (As shown in the commented-out *Option 2* below) and scale up the job's compute resources to complete the work quickly.\n",
711711
"\n",
712712
"To process just a sample subset of files for speed in our demo, we'll create a **manifest file** listing just the documents we want.\n",
713713
"\n",
714-
"> ⚠️ **Note:** 'Non-augmented' manifest files are still JSON-based, but a different format from the other dataset manifests we'll be using through this sample. You can find guidance for manifests as used here on the [S3DataSource API doc](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_S3DataSource.html), and separate information on the \"augmented\" manifests as used later with SageMaker Ground Truth in the [Ground Truth documentation](https://docs.aws.amazon.com/sagemaker/latest/dg/sms-input-data-input-manifest.html)."
714+
"> ⚠️ **Note:** 'Non-augmented' manifest files are still JSON-based, but a different format from the other dataset manifests we'll be using through this sample. You can find guidance on the [S3DataSource API doc](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_S3DataSource.html) for manifests as used here, and separate information in the [Ground Truth documentation](https://docs.aws.amazon.com/sagemaker/latest/dg/sms-input-data-input-manifest.html) on the \"augmented\" manifests as used later with SageMaker Ground Truth."
715715
]
716716
},
717717
{
@@ -768,9 +768,9 @@
768768
"cell_type": "markdown",
769769
"metadata": {},
770770
"source": [
771-
"The script we'll be using to process the documents is in the same folder as the Dockerfile used earlier to build the container image: [preproc/imgclean.py](preproc/imageclean.py).\n",
771+
"The script we'll be using to process the documents is in the same folder as the Dockerfile used earlier to build the container image: [preproc/imgclean.py](preproc/imgclean.py).\n",
772772
"\n",
773-
"The code parallelizes processing across available CPUs, and the `ShardedByS3Key` setting used on our [ProcessingInput](https://sagemaker.readthedocs.io/en/stable/api/training/processing.html#sagemaker.processing.ProcessingInput) above distributes documents between instances if muliple are provided - so you should be able to `instance_type` and `instance_count` of the job if needed to take advantage of what resources you have available. The process is typically CPU-bound, so the `ml.c*` families are likely a good fit.\n",
773+
"The code parallelizes processing across available CPUs, and the `ShardedByS3Key` setting used on our [ProcessingInput](https://sagemaker.readthedocs.io/en/stable/api/training/processing.html#sagemaker.processing.ProcessingInput) above distributes documents between instances if multiple are provided. This means you should be able to adjust both `instance_type` and `instance_count` below, and still take advantage of the resources configured. The process is typically CPU-bound, so the `ml.c*` families are likely a good fit.\n",
774774
"\n",
775775
"The cell below will **run the processing job** and show logs from the job as it progresses. You can also check up on the status and history of jobs in the [Processing page of the Amazon SageMaker Console](https://console.aws.amazon.com/sagemaker/home?#/processing-jobs).\n",
776776
"\n",

0 commit comments

Comments
 (0)