doc(nbs): minor pre-training clarifications

athewsey · athewsey · commit d1d82e049073 · 2021-11-18T03:10:46.000Z
diff --git a/notebooks/2. Model Training.ipynb b/notebooks/2. Model Training.ipynb
@@ -315,9 +315,9 @@
     "\n",
     "In many cases, businesses have a great deal more relevant *unlabelled* data available in addition to the manually labeled dataset. For example, you might have many more historical documents available (with OCR results already, or able to be processed with Amazon Textract) than you're reasonably able to annotate entities on - just as we do in this example!\n",
     "\n",
-    "Large-scale language models like the LayoutLM architecture we use here are typically **pre-trained** to unlabelled data in a **self-supervised** pattern: Teaching the model to predict some implicit task in the data like, for example, masking a few words on the page and predicting what words should go in the gaps.\n",
+    "Large-scale language models like LayoutLM are typically **pre-trained** to unlabelled data in a **self-supervised** pattern: Teaching the model to predict some implicit task in the data like, for example, masking a few words on the page and predicting what words should go in the gaps.\n",
     "\n",
-    "This pre-training doesn't directly teach the model to perform the target task (i.e. classifying entities), but forces the core of the model to learn intrinsic patterns in the data. When we then replace the output layers and **fine-tune** towards the target task with human-labelled data, the model is able to learn the target task more effectively.\n",
+    "This pre-training doesn't directly teach the model to perform the target task (classifying entities), but forces the core of the model to learn intrinsic patterns in the data. When we then replace the output layers and **fine-tune** towards the target task with human-labelled data, the model is able to learn the target task more effectively.\n",
     "\n",
     "**In this example, pre-training is optional**:\n",
     "\n",
@@ -332,7 +332,7 @@
     ">\n",
     "> - Pre-training on only the 120 \"sample\" documents to 25 epochs took about 30 minutes on an `ml.p3.8xlarge` instance with per-device batch size 4\n",
     "> - Pre-training on a larger 500-document subset with the same infrastructure and settings took about an hour\n",
-    "> - Although the observed effect on downstream (entity recognition) accuracy was generally positive in either case, it was not significant compared to variation over random seed initializations in fine-tuning. In this credit cards example use case, ambiguity and variability of the defined field types are likely to be more relevant limiting factors - rather domain-specific language variation versus the public pre-trained model."
+    "> - Although the observed effect on downstream (entity recognition) accuracy metrics was generally positive in either case, it was small compared to variation over random seed initializations in fine-tuning."
    ]
   },
   {
@@ -411,7 +411,7 @@
     "\n",
     "Since customized inputs for this job might be more variable than fine-tuning (because annotating data requires effort, but scaling up your unlabelled corpus may be easy), it's worth mentioning some relevant parameter options:\n",
     "\n",
-    "- **`instance_type`**: While `ml.g4dn.xlarge` is a nice, low-cost, GPU-enabled option for our small data fine-tuning job later; the larger data volume in pre-training makes the speed-up available from `ml.p3.2xlarge` more significant. The provided script is multi-GPU capable, so for bigger jobs you may find `ml.p3.8xlarge` and beyond give more acceptable run-times.\n",
+    "- **`instance_type`**: While `ml.g4dn.xlarge` is a nice, low-hourly-cost, GPU-enabled option for our small data fine-tuning job later; the larger data volume in pre-training makes the speed-up available from `ml.p3.2xlarge` more significant. The provided script is multi-GPU capable, so for bigger jobs you may find `ml.p3.8xlarge` and beyond give more acceptable run-times.\n",
     "- **`per_device_train_batch_size`**: Controls *per-accelerator* batching; so bear in mind that moving up to a multi-GPU instance type (such as 4 GPUs in an `ml.p3.8xlarge`) implicitly increases the overall batch size for learning.\n",
     "- Other hyperparameters are available, as the implementation is generally based on the [Hugging Face TrainingArguments parser](https://huggingface.co/transformers/main_classes/trainer.html#transformers.TrainingArguments) with [customizations applied in src/code/config.py](src/code/config.py)"
    ]
@@ -465,7 +465,6 @@
     "    base_job_name=\"layoutlm-cfpb-pretrain\",\n",
     "    output_path=f\"s3://{bucket_name}/{bucket_prefix}trainjobs\",\n",
     "\n",
-    "    # For big datasets and long runs, p3.2xl may be much faster than g4dn.xl\n",
     "    instance_type=\"ml.p3.8xlarge\",\n",
     "    instance_count=1,\n",
     "    volume_size=50,\n",