doc(nbs): tidy up preproc commentary

athewsey · athewsey · commit 4bee119f5f62 · 2021-11-03T12:35:46.000Z
diff --git a/notebooks/1. Data Preparation.ipynb b/notebooks/1. Data Preparation.ipynb
@@ -707,11 +707,11 @@
    "source": [
     "Next, we need to **define the inputs** for the processing job.\n",
     "\n",
-    "To process the whole `data/raw` corpus, you could simply pass the whole `data/raw` prefix in S3 as input to the job (As shown in the commented-out *Option 2* below) and scale up the `instance_count` to complete the work quickly.\n",
+    "To process the whole `data/raw` corpus, you could simply pass the whole `data/raw` prefix in S3 as input to the job (As shown in the commented-out *Option 2* below) and scale up the job's compute resources to complete the work quickly.\n",
     "\n",
     "To process just a sample subset of files for speed in our demo, we'll create a **manifest file** listing just the documents we want.\n",
     "\n",
-    "> ⚠️ **Note:** 'Non-augmented' manifest files are still JSON-based, but a different format from the other dataset manifests we'll be using through this sample. You can find guidance for manifests as used here on the [S3DataSource API doc](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_S3DataSource.html), and separate information on the \"augmented\" manifests as used later with SageMaker Ground Truth in the [Ground Truth documentation](https://docs.aws.amazon.com/sagemaker/latest/dg/sms-input-data-input-manifest.html)."
+    "> ⚠️ **Note:** 'Non-augmented' manifest files are still JSON-based, but a different format from the other dataset manifests we'll be using through this sample. You can find guidance on the [S3DataSource API doc](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_S3DataSource.html) for manifests as used here, and separate information in the [Ground Truth documentation](https://docs.aws.amazon.com/sagemaker/latest/dg/sms-input-data-input-manifest.html) on the \"augmented\" manifests as used later with SageMaker Ground Truth."
    ]
   },
   {
@@ -768,9 +768,9 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "The script we'll be using to process the documents is in the same folder as the Dockerfile used earlier to build the container image: [preproc/imgclean.py](preproc/imageclean.py).\n",
+    "The script we'll be using to process the documents is in the same folder as the Dockerfile used earlier to build the container image: [preproc/imgclean.py](preproc/imgclean.py).\n",
     "\n",
-    "The code parallelizes processing across available CPUs, and the `ShardedByS3Key` setting used on our [ProcessingInput](https://sagemaker.readthedocs.io/en/stable/api/training/processing.html#sagemaker.processing.ProcessingInput) above distributes documents between instances if muliple are provided - so you should be able to `instance_type` and `instance_count` of the job if needed to take advantage of what resources you have available. The process is typically CPU-bound, so the `ml.c*` families are likely a good fit.\n",
+    "The code parallelizes processing across available CPUs, and the `ShardedByS3Key` setting used on our [ProcessingInput](https://sagemaker.readthedocs.io/en/stable/api/training/processing.html#sagemaker.processing.ProcessingInput) above distributes documents between instances if multiple are provided. This means you should be able to adjust both `instance_type` and `instance_count` below, and still take advantage of the resources configured. The process is typically CPU-bound, so the `ml.c*` families are likely a good fit.\n",
     "\n",
     "The cell below will **run the processing job** and show logs from the job as it progresses. You can also check up on the status and history of jobs in the [Processing page of the Amazon SageMaker Console](https://console.aws.amazon.com/sagemaker/home?#/processing-jobs).\n",
     "\n",