|
| 1 | +# Applying and Customizing the Amazon Textract Transformer Pipeline |
| 2 | + |
| 3 | +This file contains suggestions and considerations to help you apply and customize the sample to your own use cases. |
| 4 | + |
| 5 | +> ⚠️ **Remember:** This repository is an illustrative sample, not intended as fully production-ready code. The guidance here is **not** an exhaustive path-to-production checklist. |
| 6 | +
|
| 7 | + |
| 8 | +## Bring your own dataset: Getting started step-by-step |
| 9 | + |
| 10 | +So you've cloned this repository and reviewed the "Getting started" installation steps on [the README](README.md) - How can you get started with your own dataset instead of the credit card agreements example? |
| 11 | + |
| 12 | +### Step 1: Any up-front CDK customizations |
| 13 | + |
| 14 | +Depending on your use case you might want to make some customizations to the pipeline infrastructure itself. You can always revisit this later by running `cdk deploy` again to update your stack - but if you know up-front that some adjustments will be needed, you might choose to make them first. |
| 15 | + |
| 16 | +Particular examples might include: |
| 17 | + |
| 18 | +- Tuning to support **large documents**, especially if you'll be processing documents more than ~100-150 pages |
| 19 | +- Enabling **additional online Textract features** (e.g. `TABLES` and/or `FORMS`) if you'll need them in online processing |
| 20 | + |
| 21 | +For details on these and other use cases, see the **Customizing the pipeline** section below. |
| 22 | + |
| 23 | +### Step 2: Deploy the stack and set up SageMaker |
| 24 | + |
| 25 | +Follow the "Getting started" steps as outlined in the [the README](README.md) to deploy your pipeline and set up your SageMaker notebook environment with the sample code and notebooks - but don't start running through the notebooks just yet. |
| 26 | + |
| 27 | +### Step 3: Clear out the sample annotations |
| 28 | + |
| 29 | +In SageMaker, delete the provided `notebooks/data/annotations/augmentation-*` folders of pre-baked annotations on the credit card documents. |
| 30 | + |
| 31 | +> **Why?:** The logic in notebook 1 for selecting a sample of documents to Textract and annotate automatically looks at your existing `data/annotations` to choose target files - so you'll see missing document errors if you don't delete these annotation files first. |
| 32 | +
|
| 33 | +### Step 4: Load your documents to SageMaker |
| 34 | + |
| 35 | +Start running through [notebook 1](notebooks/1.%20Data%20Preparation.ipynb) but follow the guidance in the *"Fetch the data"* section to load your raw documents into the `notebooks/data/raw` folder in SageMaker **instead** of the sample CFPB documents. |
| 36 | + |
| 37 | +How you load your documents into SageMaker may differ depending on where they're stored today. For example: |
| 38 | + |
| 39 | +- If they're currently on your local computer, you should be able to drag and drop them to the folder pane in SageMaker/JupyterLab to upload. |
| 40 | +- If they're currently on Amazon S3, you can copy a folder by running e.g. `!aws s3 sync s3://{DOC-EXAMPLE-BUCKET}/my/folder data/raw` from a cell in notebook 1. |
| 41 | +- If they're currently compressed in a zip file, you can refer to the code used on the example data to help extract and tidy up the files. |
| 42 | + |
| 43 | +> **Note:** Your customized notebook should still set the `rel_filepaths` variable which is used by later steps. |
| 44 | +
|
| 45 | +### Step 5: Customize the field configurations |
| 46 | + |
| 47 | +In the *Defining the challenge* section of notebook 1, customize the `fields` list of `FieldConfiguration` definitions for the entities/fields you'd like to extract. |
| 48 | + |
| 49 | +Each `FieldConfiguration` defines not just the name of the the field, but also other attributes like: |
| 50 | + |
| 51 | +- The `annotation_guidance` that should be shown in the labelling tool to help workers annotate consistently |
| 52 | +- How a single representative value should be `select`ed from multiple detected entity mentions, or else (implicit) that multiple values should be preserved |
| 53 | +- Whether the field is mandatory on a document (implicit), or else `optional` or even completely `ignore`d |
| 54 | + |
| 55 | +Your model will perform best if: |
| 56 | + |
| 57 | +- It's possible to highlight your configured entities with complete consistency (no differences of opinion between annotators or tricky edge cases) |
| 58 | +- Matches of your entities can be confirmed with local context somewhat nearby on the page (e.g. not dictated by a sentence right at the other end of the page, or on a previous page) |
| 59 | + |
| 60 | +Configuring a large number of `fields` may increase required memory footprint of the model and downstream components in the pipeline, and could impact the accuracy of trained models. The SageMaker Ground Truth bounding box annotation tool used in the sample supports up to 50 labels as documented [in the SageMaker Developer Guide](https://docs.aws.amazon.com/sagemaker/latest/dg/sms-bounding-box.html) - so attempting to configure more than 50 `fields` would require additional workarounds. |
| 61 | + |
| 62 | +### Step 6: Enabling batch Textract features (if you want them) |
| 63 | + |
| 64 | +In the *Textract the input documents* section of notebook 1, you'll see by default `features=[]` to optimize costs - since the SageMaker model and sample post-processing Lambda do not use or depend on additional Amazon Textract features like `TABLES` and `FORMS`. |
| 65 | + |
| 66 | +If you'd like to enable these extra analyses for your batch processing in the notebook, set e.g. `features=["FORMS", "TABLES"]`. This setting is for the batch analysis only and does not affect the online behaviour of the deployed pipeline. |
| 67 | + |
| 68 | +### Step 7: Customize how pages are selected for annotation |
| 69 | + |
| 70 | +In the credit card agreements example, there's lots of data and no strong requirements on what to annotate. The code in the *Collect a dataset* section of notebook 1 selects pages from the corpus at random, but with a fixed (higher) proportion of first-page samples because the example entities are mostly likely to occur at the start of the document. |
| 71 | + |
| 72 | +For your own use cases this emphasis on first pages may not apply. Moreover if you're strongly data- or time-constrained you might prefer to pick out a specific list of most-relevant pages to annotate! |
| 73 | + |
| 74 | +Consider editing the `select_examples()` function to customize how the set of candidate document pages is chosen for the next annotation job, excluding the already-labelled `exclude_img_paths`. |
| 75 | + |
| 76 | +### Step 8: Proceed with data annotation and subsequent steps |
| 77 | + |
| 78 | +From the labelling job onwards (through notebook 2 and beyond), the flow should be essentially the same as with the sample data. Just remember to edit the `include_jobs` list in notebook 2 to reflect the actual annotation jobs you performed. |
| 79 | + |
| 80 | +If your dataset is particularly tiny (more like e.g. 30 labelled pages than 100), it might be helpful to try increasing the `early_stopping_patience` hyperparameter to force the training job to re-process the same examples for longer. You could also explore hyperparameter tuning. However, it'd likely have a bigger impact to spend that time annotatting more data instead! |
| 81 | + |
| 82 | + |
| 83 | +## Customizing the pipeline |
| 84 | + |
| 85 | +### Handling large documents |
| 86 | + |
| 87 | +Because some components of the pipeline have configured timeouts or process consolidated document Textract JSON in memory, scaling to very large documents (e.g. hundreds of pages) may require some configuration changes in the CDK solution. |
| 88 | + |
| 89 | +Consider: |
| 90 | + |
| 91 | +- Increasing the `timeout_excluding_queue` (in [pipeline/ocr/__init__.py TextractOcrStep](pipeline/ocr/__init__.py)) to accommodate the longer Textract processing and Lambda consolidation time (e.g. to 20mins+) |
| 92 | +- Increasing the `timeout` and `memory_size` of the `CallTextract` Lambda function in [pipeline/ocr/__init__.py](pipeline/ocr/__init__.py) to accommodate consolidating the large Textract result JSON to a single S3 file (e.g. to 300sec, 1024MB) |
| 93 | +- Likewise increasing the `memory_size` of the `PostProcessFn` Lambda function in [pipeline/postprocessing/__init__.py](pipeline/postprocessing/__init__.py), which also loads and processes full document JSON (e.g. to 1024MB) |
| 94 | + |
| 95 | + |
| 96 | +### Using Amazon Textract `TABLES` and `FORMS` features in the pipeline |
| 97 | + |
| 98 | +The sample SageMaker model and post-processing Lambda function neither depend on nor use the additional [tables](https://docs.aws.amazon.com/textract/latest/dg/how-it-works-tables.html) and [forms](https://docs.aws.amazon.com/textract/latest/dg/how-it-works-kvp.html) features of Amazon Textract and therefore by default they're disabled in the pipeline. |
| 99 | + |
| 100 | +To enable these features for documents processed by the pipeline, you could for example: |
| 101 | + |
| 102 | +- Add a key specifying `sfn_input["Textract"]["Features"] = ["FORMS", "TABLES"]` to the S3 trigger Lambda function in [pipeline/fn-trigger/main.py](pipeline/fn-trigger/main.py) to explicitly set this combination for all pipeline executions triggered by S3 uploads, OR |
| 103 | +- Add a `DEFAULT_TEXTRACT_FEATURES=FORMS,TABLES` environment variable to the `CallTextract` Lambda function in [pipeline/ocr/__init__.py](pipeline/ocr/__init__.py) to make that the default setting whenever a pipeline run doesn't explicitly configure it. |
| 104 | + |
| 105 | +Once the features are enabled for your pipeline, you can edit the post-processing Lambda function (in [pipeline/postprocessing/fn-postprocess](pipeline/postprocessing/fn-postprocess)) to combine them with your SageMaker model results as needed. |
| 106 | + |
| 107 | +For example you could loop through the rows and cells of detected `TABLE`s in your document, using the SageMaker entity model results for each `WORD` to find the specific records and columns that you're interested in. |
| 108 | + |
| 109 | +If you need to change the output format for the post-processing Lambda function, note that the A2I human review template will likely also need to be updated. |
0 commit comments