Skip to content

Commit b419a3c

Browse files
author
Bob Strahan
committed
Merge branch 'develop' v0.3.16
2 parents 71c9013 + 23c10ad commit b419a3c

File tree

109 files changed

+16677
-2320
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

109 files changed

+16677
-2320
lines changed

.gitattributes

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,5 @@
1+
* text=auto eol=lf
2+
*.py text eol=lf
3+
*.sh text eol=lf
4+
*.yaml text eol=lf
5+
*.yml text eol=lf

.gitlab-ci.yml

Lines changed: 5 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -56,14 +56,15 @@ integration_tests:
5656
# AWS_DEFAULT_REGION: ${AWS_DEFAULT_REGION}
5757
# IDP_ACCOUNT_ID: ${IDP_ACCOUNT_ID}
5858

59-
# Add rules to only run on develop branch
59+
# Add rules to only run on develop branch
60+
# Add rules to only run on develop branch
6061
rules:
6162
- if: $CI_COMMIT_BRANCH == "develop"
62-
when: manual # always # When idp-accelerator CICD is reconfigured
63+
when: always # always # When idp-accelerator CICD is reconfigured
6364
- if: $CI_COMMIT_BRANCH =~ /^feature\/.*/
64-
when: manual
65+
when: always
6566
- if: $CI_COMMIT_BRANCH =~ /^fix\/.*/
66-
when: manual
67+
when: always
6768
- if: $CI_COMMIT_BRANCH =~ /^hotfix\/.*/
6869
when: manual
6970
- if: $CI_COMMIT_BRANCH =~ /^release\/.*/

CHANGELOG.md

Lines changed: 88 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -5,6 +5,94 @@ SPDX-License-Identifier: MIT-0
55

66
## [Unreleased]
77

8+
### Added
9+
10+
## [0.3.16]
11+
12+
### Added
13+
14+
- **S3 Vectors Support for Cost-Optimized Knowledge Base Storage**
15+
- Added S3 Vectors as alternative vector store option to OpenSearch Serverless for Bedrock Knowledge Base with lower storage costs
16+
- Custom resource Lambda implementation for S3 vector bucket and index management (using boto3 s3vectors client) with proper IAM permissions and resource cleanup
17+
- Unified Knowledge Base interface supporting both vector store types with automatic resource provisioning based on user selection
18+
19+
- **Page Limit Configuration for Classification Control**
20+
- Added `maxPagesForClassification` configuration option to control how many pages are used during document classification
21+
- **Default Behavior**: `"ALL"` - uses all pages for classification (existing behavior)
22+
- **Limited Page Classification**: Set to numeric value (e.g., `"1"`, `"2"`, `"3"`) to classify only the first N pages
23+
- **Important**: When using numeric limit, the classification result from the first N pages is applied to ALL pages in the document, effectively forcing the entire document to be assigned a single class with one section
24+
- **Use Cases**: Performance optimization for large documents, cost reduction for documents with consistent classification patterns, simplified processing for homogeneous document types
25+
26+
- **CloudFormation Service Role for Delegated Deployment Access**
27+
- Added example CloudFormation service role template that enables non-administrator users to deploy and maintain IDP stacks without requiring ongoing administrator permissions
28+
- Administrators can provision the service role once with elevated privileges, then delegate deployment capabilities to developer/DevOps teams
29+
- Includes comprehensive documentation and cross-referenced deployment guides explaining the security model and setup process
30+
31+
32+
### Fixed
33+
- Fixed issue where CloudFront policy statements were still appearing in generated GovCloud templates despite CloudFront resources being removed
34+
- Fix duplicate Glue tables are created when using a document class that contains a dash (-). Resolved by replacing dash in section types with underscore character when creating the table, to align with the table name generated later by the Glue crawler - resolves #57.
35+
- Fix occasional UI error 'Failed to get document details - please try again later' - resolves #58
36+
- Fixed UI zipfile creation to exclude .aws-sam directories and .env files from deployment package
37+
- Added security recommendation to set LogLevel parameter to WARN or ERROR (not INFO) for production deployments to prevent logging of sensitive information including PII data, document contents, and S3 presigned URLs
38+
- Hardened several aspects of the new Discovery feature
39+
40+
## [0.3.15]
41+
42+
### Added
43+
44+
- **Intelligent Document Discovery Module for Automated Configuration Generation**
45+
- Added Discovery module that automatically analyzes document samples to identify structure, field types, and organizational patterns
46+
- **Pattern-Neutral Design**: Works across all processing patterns (1, 2, 3) with unified discovery process and pattern-specific implementations
47+
- **Dual Discovery Methods**: Discovery without ground truth (exploratory analysis) and with ground truth (optimization using labeled data)
48+
- **Automated Blueprint Creation**: Pattern 1 includes zero-touch BDA blueprint generation with intelligent change detection and version management
49+
- **Web UI Integration**: Real-time discovery job monitoring, interactive results review, and seamless configuration integration
50+
- **Advanced Features**: Multi-model support (Nova, Claude), customizable prompts, configurable parameters, ground truth processing, schema conversion, and lifecycle management
51+
- **Key Benefits**: Rapid new document type onboarding, reduced time-to-production, configuration optimization, and automated workflow bootstrapping
52+
- **Use Cases**: New document exploration, configuration improvement, rapid prototyping, and document understanding
53+
- **Documentation**: Guide in `docs/discovery.md` with architecture details, best practices, and troubleshooting
54+
55+
- **Optional Pattern-2 Regex-Based Classification for Enhanced Performance**
56+
- Added support for optional regex patterns in document class definitions for performance optimization
57+
- **Document Name Regex**: Match against document ID/name to classify all pages without LLM processing when all pages should be the same class
58+
- **Document Page Content Regex**: Match against page text content during multi-modal page-level classification for fast page classification
59+
- **Key Benefits**: Significant performance improvements and cost savings by bypassing LLM calls for pattern-matched documents, deterministic classification results for known document patterns, seamless fallback to existing LLM classification when regex patterns don't match
60+
- **Configuration**: Optional `document_name_regex` and `document_page_content_regex` fields in class definitions with automatic regex compilation and validation
61+
- **Logging**: Comprehensive info-level logging when regex patterns match for observability and debugging
62+
- **CloudFormation Integration**: Updated Pattern-2 schema to support regex configuration through the Web UI
63+
- **Demonstration**: New `step2_classification_with_regex.ipynb` notebook showcasing regex configuration and performance comparisons
64+
- **Documentation**: Enhanced classification module README and main documentation with regex usage examples and best practices
65+
66+
- **Windows WSL Development Environment Setup Guide**
67+
- Added WSL-based development environment setup guide for Windows developers in `docs/setup-development-env-WSL.md`
68+
- **Key Features**: Automated setup script (`wsl_setup.sh`) for quick installation of Git, Python, Node.js, AWS CLI, and SAM CLI
69+
- **Integrated Workflow**: Development setup combining Windows tools (VS Code, browsers) with native Linux environment
70+
- **Target Use Cases**: Windows developers needing Linux compatibility without Docker Desktop or VM overhead
71+
72+
### Fixed
73+
- **Throttling Error Detection and Retry Logic for Assessment Functions** - [GitHub Issue #45](https://github.com/aws-solutions-library-samples/accelerated-intelligent-document-processing-on-aws/issues/45)
74+
- **Assessment Function**: Enhanced throttling detection to check for throttling errors returned in `document.errors` field in addition to thrown exceptions, raising `ThrottlingException` to trigger Step Functions retry when throttling is detected
75+
- **Granular Assessment Task Caching**: Fixed caching logic to properly cache successful assessment tasks when there are ANY failed tasks (both exception-based and result-based failures), enabling efficient retry optimization by only reprocessing failed tasks while preserving successful results
76+
- **Impact**: Improved resilience for throttling scenarios, reduced redundant processing during retries, and better Step Functions retry behavior
77+
78+
- **Security Vulnerability Mitigation - Package Updates**
79+
80+
- **GovCloud Compatibility - Hardcoded Service Domain References**
81+
- Fixed hardcoded `amazonaws.com` references in CloudFormation templates that prevented GovCloud deployment
82+
- Updated all service principals and endpoints to use dynamic `${AWS::URLSuffix}` expressions for automatic region-based resolution
83+
- **Templates Updated**: `template.yaml` (main template), `patterns/pattern-3/sagemaker_classifier_endpoint.yaml`
84+
- **Services Fixed**: EventBridge, Cognito, SageMaker, ECR, CloudFront, CodeBuild, AppSync, Lambda, DynamoDB, CloudWatch Logs, Glue
85+
- Resolves GitHub Issue #50 - templates now deploy correctly in both standard AWS and GovCloud regions
86+
87+
- **Bug Fixes and Code Improvements**
88+
- Fixed HITL processing errors in both Pattern-1 (DynamoDB validation with empty strings) and Pattern-2 (string indices error in A2I output processing)
89+
- Fixed Step Function UI issues including auto-refresh button auto-disable and fetch failures for failed executions with datetime serialization errors
90+
- Cleaned up unused Step Function subscription infrastructure and removed duplicate code in Pattern-2 HITL function
91+
- Expanded UI Visual Editor bounding box size with padding for better visibility and user interaction
92+
- Fixed bug in list of models supporting cache points - previously claude 4 sonnet and opus had been excluded.
93+
- Validations added at the assessment step for checking valid json response. The validation fails after extraction/assessment is complete if json parsing issues are encountered.
94+
95+
896
## [0.3.14]
997

1098
### Added

Makefile

Lines changed: 15 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -40,26 +40,34 @@ lint-cicd:
4040
fi
4141
@echo -e "$(GREEN)All code quality checks passed!$(NC)"
4242

43-
# Check CloudFormation templates for hardcoded AWS partition ARNs
43+
# Check CloudFormation templates for hardcoded AWS partition ARNs and service principals
4444
check-arn-partitions:
45-
@echo "Checking CloudFormation templates for hardcoded ARN partitions..."
45+
@echo "Checking CloudFormation templates for hardcoded ARN partitions and service principals..."
4646
@FOUND_ISSUES=0; \
4747
for template in template.yaml patterns/*/template.yaml patterns/*/sagemaker_classifier_endpoint.yaml options/*/template.yaml; do \
4848
if [ -f "$$template" ]; then \
4949
echo "Checking $$template..."; \
50-
MATCHES=$$(grep -n "arn:aws:" "$$template" | grep -v "arn:\$${AWS::Partition}:" || true); \
51-
if [ -n "$$MATCHES" ]; then \
50+
ARN_MATCHES=$$(grep -n "arn:aws:" "$$template" | grep -v "arn:\$${AWS::Partition}:" || true); \
51+
if [ -n "$$ARN_MATCHES" ]; then \
5252
echo -e "$(RED)ERROR: Found hardcoded 'arn:aws:' references in $$template:$(NC)"; \
53-
echo "$$MATCHES" | sed 's/^/ /'; \
53+
echo "$$ARN_MATCHES" | sed 's/^/ /'; \
5454
echo -e "$(YELLOW) These should use 'arn:\$${AWS::Partition}:' instead for GovCloud compatibility$(NC)"; \
5555
FOUND_ISSUES=1; \
5656
fi; \
57+
SERVICE_MATCHES=$$(grep -n "\.amazonaws\.com" "$$template" | grep -v "\$${AWS::URLSuffix}" | grep -v "^[[:space:]]*#" | grep -v "Description:" | grep -v "Comment:" | grep -v "cognito" | grep -v "ContentSecurityPolicy" || true); \
58+
if [ -n "$$SERVICE_MATCHES" ]; then \
59+
echo -e "$(RED)ERROR: Found hardcoded service principal references in $$template:$(NC)"; \
60+
echo "$$SERVICE_MATCHES" | sed 's/^/ /'; \
61+
echo -e "$(YELLOW) These should use '\$${AWS::URLSuffix}' instead of 'amazonaws.com' for GovCloud compatibility$(NC)"; \
62+
echo -e "$(YELLOW) Example: 'lambda.amazonaws.com' should be 'lambda.\$${AWS::URLSuffix}'$(NC)"; \
63+
FOUND_ISSUES=1; \
64+
fi; \
5765
fi; \
5866
done; \
5967
if [ $$FOUND_ISSUES -eq 0 ]; then \
60-
echo -e "$(GREEN)✅ No hardcoded ARN partition references found!$(NC)"; \
68+
echo -e "$(GREEN)✅ No hardcoded ARN partition or service principal references found!$(NC)"; \
6169
else \
62-
echo -e "$(RED)❌ Found hardcoded ARN partition references that need to be fixed$(NC)"; \
70+
echo -e "$(RED)❌ Found hardcoded references that need to be fixed for GovCloud compatibility$(NC)"; \
6371
exit 1; \
6472
fi
6573

README.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -128,6 +128,7 @@ For detailed deployment and testing instructions, see the [Deployment Guide](./d
128128
- [Agent Analysis](./docs/agent-analysis.md) - Natural language analytics and data visualization feature
129129
- [Custom MCP Agent](./docs/custom-MCP-agent.md) - Integrating external MCP servers for custom tools and capabilities
130130
- [Configuration](./docs/configuration.md) - Configuration and customization options
131+
- [Discovery](./docs/discovery.md) - Pattern-neutral discovery process and BDA blueprint automation
131132
- [Classification](./docs/classification.md) - Customizing document classification
132133
- [Extraction](./docs/extraction.md) - Customizing information extraction
133134
- [Human-in-the-Loop Review](./docs/human-review.md) - Human review workflows with Amazon A2I

VERSION

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1 +1 @@
1-
0.3.14
1+
0.3.16

config_library/pattern-1/lending-package-sample/config.yaml

Lines changed: 102 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -215,3 +215,105 @@ pricing:
215215
price: '1.5E-6'
216216
- name: cacheWriteInputTokens
217217
price: '1.875E-5'
218+
discovery:
219+
output_format:
220+
sample_json: |-
221+
{
222+
"document_class" : "Form-1040",
223+
"document_description" : "Brief summary of the document",
224+
"groups" : [
225+
{
226+
"name" : "PersonalInformation",
227+
"description" : "Personal information of Tax payer",
228+
"attributeType" : "group",
229+
"groupAttributes" : [
230+
{
231+
"name": "FirstName",
232+
"dataType" : "string",
233+
"description" : "First Name of Taxpayer"
234+
},
235+
{
236+
"name": "Age",
237+
"dataType" : "number",
238+
"description" : "Age of Taxpayer"
239+
}
240+
]
241+
},
242+
{
243+
"name" : "Dependents",
244+
"description" : "Dependents of taxpayer",
245+
"attributeType" : "list",
246+
"listItemTemplate": {
247+
"itemAttributes" : [
248+
{
249+
"name": "FirstName",
250+
"dataType" : "string",
251+
"description" : "Dependent first name"
252+
},
253+
{
254+
"name": "Age",
255+
"dataType" : "number",
256+
"description" : "Dependent Age"
257+
}
258+
]
259+
}
260+
}
261+
]
262+
}
263+
with_ground_truth:
264+
top_p: '0.1'
265+
temperature: '1.0'
266+
user_prompt: >-
267+
This image contains unstructured data. Analyze the data line by line using the provided ground truth as reference.
268+
<GROUND_TRUTH_REFERENCE>
269+
{ground_truth_json}
270+
</GROUND_TRUTH_REFERENCE>
271+
Ground truth reference JSON has the fields we are interested in extracting from the document/image. Use the ground truth to optimize field extraction. Match field names, data types, and groupings from the reference.
272+
Image may contain multiple pages, process all pages.
273+
Extract all field names including those without values.
274+
Do not change the group name and field name from ground truth in the extracted data json.
275+
Add field_description field for every field which will contain instruction to LLM to extract the field data from the image/document. Add data_type field for every field.
276+
Add two fields document_class and document_description.
277+
For document_class generate a short name based on the document content like W4, I-9, Paystub.
278+
For document_description generate a description about the document in less than 50 words.
279+
If the group repeats and follows table format, update the attributeType as "list".
280+
Do not extract the values.
281+
Format the extracted data using the below JSON format:
282+
Format the extracted groups and fields using the below JSON format:
283+
284+
model_id: us.amazon.nova-pro-v1:0
285+
system_prompt: >-
286+
You are an expert in processing forms. Extracting data from images and
287+
documents. Use provided ground truth data as reference to optimize field
288+
extraction and ensure consistency with expected document structure and
289+
field definitions.
290+
max_tokens: '10000'
291+
without_ground_truth:
292+
top_p: '0.1'
293+
temperature: '1.0'
294+
user_prompt: >-
295+
This image contains forms data. Analyze the form line by line.
296+
Image may contains multiple pages, process all the pages.
297+
Form may contain multiple name value pair in one line.
298+
Extract all the names in the form including the name value pair which doesn't have value.
299+
Organize them into groups, extract field_name, data_type and field description
300+
Field_name should be less than 60 characters, should not have space use '-' instead of space.
301+
field_description is a brief description of the field and the location of the field like box number or line number in the form and section of the form.
302+
Field_name should be unique within the group.
303+
Add two fields document_class and document_description.
304+
For document_class generate a short name based on the document content like W4, I-9, Paystub.
305+
For document_description generate a description about the document in less than 50 words.
306+
307+
Group the fields based on the section they are grouped in the form. Group should have attributeType as "group".
308+
If the group repeats and follows table format, update the attributeType as "list".
309+
Do not extract the values.
310+
Return the extracted data in JSON format.
311+
Format the extracted data using the below JSON format:
312+
Format the extracted groups and fields using the below JSON format:
313+
model_id: us.amazon.nova-pro-v1:0
314+
system_prompt: >-
315+
You are an expert in processing forms. Extracting data from images and
316+
documents. Analyze forms line by line to identify field names, data types,
317+
and organizational structure. Focus on creating comprehensive blueprints
318+
for document processing without extracting actual values.
319+
max_tokens: '10000'

0 commit comments

Comments
 (0)