aws-solutions-library-samples
diff --git a/‎.gitattributes‎
Lines changed: 5 additions & 0 deletions b/‎.gitattributes‎
Lines changed: 5 additions & 0 deletions
diff --git a/‎.gitlab-ci.yml‎
Lines changed: 5 additions & 4 deletions b/‎.gitlab-ci.yml‎
Lines changed: 5 additions & 4 deletions
diff --git a/‎CHANGELOG.md‎
Lines changed: 88 additions & 0 deletions b/‎CHANGELOG.md‎
Lines changed: 88 additions & 0 deletions
diff --git a/‎Makefile‎
Lines changed: 15 additions & 7 deletions b/‎Makefile‎
Lines changed: 15 additions & 7 deletions
diff --git a/‎README.md‎
Lines changed: 1 addition & 0 deletions b/‎README.md‎
Lines changed: 1 addition & 0 deletions
diff --git a/‎VERSION‎
Lines changed: 1 addition & 1 deletion b/‎VERSION‎
Lines changed: 1 addition & 1 deletion
diff --git a/‎config_library/pattern-1/lending-package-sample/config.yaml‎
Lines changed: 102 additions & 0 deletions b/‎config_library/pattern-1/lending-package-sample/config.yaml‎
Lines changed: 102 additions & 0 deletions
@@ -0,0 +1,5 @@
+* text=auto eol=lf
+*.py text eol=lf
+*.sh text eol=lf
+*.yaml text eol=lf
+*.yml text eol=lf
@@ -56,14 +56,15 @@ integration_tests:
   #   AWS_DEFAULT_REGION: ${AWS_DEFAULT_REGION}
   #   IDP_ACCOUNT_ID: ${IDP_ACCOUNT_ID}
 
-  # Add rules to only run on develop branch
+ # Add rules to only run on develop branch
+ # Add rules to only run on develop branch
   rules:
     - if: $CI_COMMIT_BRANCH == "develop"
-      when: manual # always # When idp-accelerator CICD is reconfigured
+      when: always # always # When idp-accelerator CICD is reconfigured
     - if: $CI_COMMIT_BRANCH =~ /^feature\/.*/
-      when: manual
+      when: always
     - if: $CI_COMMIT_BRANCH =~ /^fix\/.*/
-      when: manual
+      when: always
     - if: $CI_COMMIT_BRANCH =~ /^hotfix\/.*/
       when: manual
     - if: $CI_COMMIT_BRANCH =~ /^release\/.*/
 
@@ -5,6 +5,94 @@ SPDX-License-Identifier: MIT-0
 
 ## [Unreleased]
 
+### Added
+
+## [0.3.16]
+
+### Added
+
+- **S3 Vectors Support for Cost-Optimized Knowledge Base Storage**
+  - Added S3 Vectors as alternative vector store option to OpenSearch Serverless for Bedrock Knowledge Base with lower storage costs
+  - Custom resource Lambda implementation for S3 vector bucket and index management (using boto3 s3vectors client) with proper IAM permissions and resource cleanup
+  - Unified Knowledge Base interface supporting both vector store types with automatic resource provisioning based on user selection
+
+- **Page Limit Configuration for Classification Control**
+  - Added `maxPagesForClassification` configuration option to control how many pages are used during document classification
+  - **Default Behavior**: `"ALL"` - uses all pages for classification (existing behavior)
+  - **Limited Page Classification**: Set to numeric value (e.g., `"1"`, `"2"`, `"3"`) to classify only the first N pages
+  - **Important**: When using numeric limit, the classification result from the first N pages is applied to ALL pages in the document, effectively forcing the entire document to be assigned a single class with one section
+  - **Use Cases**: Performance optimization for large documents, cost reduction for documents with consistent classification patterns, simplified processing for homogeneous document types
+
+- **CloudFormation Service Role for Delegated Deployment Access**
+  - Added example CloudFormation service role template that enables non-administrator users to deploy and maintain IDP stacks without requiring ongoing administrator permissions
+  - Administrators can provision the service role once with elevated privileges, then delegate deployment capabilities to developer/DevOps teams
+  - Includes comprehensive documentation and cross-referenced deployment guides explaining the security model and setup process
+
+
+### Fixed
+- Fixed issue where CloudFront policy statements were still appearing in generated GovCloud templates despite CloudFront resources being removed
+- Fix duplicate Glue tables are created when using a document class that contains a dash (-). Resolved by replacing dash in section types with underscore character when creating the table, to align with the table name generated later by the Glue crawler - resolves #57.
+- Fix occasional UI error 'Failed to get document details - please try again later' - resolves #58
+- Fixed UI zipfile creation to exclude .aws-sam directories and .env files from deployment package
+- Added security recommendation to set LogLevel parameter to WARN or ERROR (not INFO) for production deployments to prevent logging of sensitive information including PII data, document contents, and S3 presigned URLs
+- Hardened several aspects of the new Discovery feature
+
+## [0.3.15]
+
+### Added
+
+- **Intelligent Document Discovery Module for Automated Configuration Generation**
+  - Added Discovery module that automatically analyzes document samples to identify structure, field types, and organizational patterns
+  - **Pattern-Neutral Design**: Works across all processing patterns (1, 2, 3) with unified discovery process and pattern-specific implementations
+  - **Dual Discovery Methods**: Discovery without ground truth (exploratory analysis) and with ground truth (optimization using labeled data)
+  - **Automated Blueprint Creation**: Pattern 1 includes zero-touch BDA blueprint generation with intelligent change detection and version management
+  - **Web UI Integration**: Real-time discovery job monitoring, interactive results review, and seamless configuration integration
+  - **Advanced Features**: Multi-model support (Nova, Claude), customizable prompts, configurable parameters, ground truth processing, schema conversion, and lifecycle management
+  - **Key Benefits**: Rapid new document type onboarding, reduced time-to-production, configuration optimization, and automated workflow bootstrapping
+  - **Use Cases**: New document exploration, configuration improvement, rapid prototyping, and document understanding
+  - **Documentation**: Guide in `docs/discovery.md` with architecture details, best practices, and troubleshooting
+
+- **Optional Pattern-2 Regex-Based Classification for Enhanced Performance**
+  - Added support for optional regex patterns in document class definitions for performance optimization
+  - **Document Name Regex**: Match against document ID/name to classify all pages without LLM processing when all pages should be the same class
+  - **Document Page Content Regex**: Match against page text content during multi-modal page-level classification for fast page classification
+  - **Key Benefits**: Significant performance improvements and cost savings by bypassing LLM calls for pattern-matched documents, deterministic classification results for known document patterns, seamless fallback to existing LLM classification when regex patterns don't match
+  - **Configuration**: Optional `document_name_regex` and `document_page_content_regex` fields in class definitions with automatic regex compilation and validation
+  - **Logging**: Comprehensive info-level logging when regex patterns match for observability and debugging
+  - **CloudFormation Integration**: Updated Pattern-2 schema to support regex configuration through the Web UI
+  - **Demonstration**: New `step2_classification_with_regex.ipynb` notebook showcasing regex configuration and performance comparisons
+  - **Documentation**: Enhanced classification module README and main documentation with regex usage examples and best practices
+  
+- **Windows WSL Development Environment Setup Guide**
+  - Added WSL-based development environment setup guide for Windows developers in `docs/setup-development-env-WSL.md`
+  - **Key Features**: Automated setup script (`wsl_setup.sh`) for quick installation of Git, Python, Node.js, AWS CLI, and SAM CLI
+  - **Integrated Workflow**: Development setup combining Windows tools (VS Code, browsers) with native Linux environment
+  - **Target Use Cases**: Windows developers needing Linux compatibility without Docker Desktop or VM overhead
+
+### Fixed
+- **Throttling Error Detection and Retry Logic for Assessment Functions** - [GitHub Issue #45](https://github.com/aws-solutions-library-samples/accelerated-intelligent-document-processing-on-aws/issues/45)
+  - **Assessment Function**: Enhanced throttling detection to check for throttling errors returned in `document.errors` field in addition to thrown exceptions, raising `ThrottlingException` to trigger Step Functions retry when throttling is detected
+  - **Granular Assessment Task Caching**: Fixed caching logic to properly cache successful assessment tasks when there are ANY failed tasks (both exception-based and result-based failures), enabling efficient retry optimization by only reprocessing failed tasks while preserving successful results
+  - **Impact**: Improved resilience for throttling scenarios, reduced redundant processing during retries, and better Step Functions retry behavior
+
+- **Security Vulnerability Mitigation - Package Updates**
+
+- **GovCloud Compatibility - Hardcoded Service Domain References**
+  - Fixed hardcoded `amazonaws.com` references in CloudFormation templates that prevented GovCloud deployment
+  - Updated all service principals and endpoints to use dynamic `${AWS::URLSuffix}` expressions for automatic region-based resolution
+  - **Templates Updated**: `template.yaml` (main template), `patterns/pattern-3/sagemaker_classifier_endpoint.yaml`
+  - **Services Fixed**: EventBridge, Cognito, SageMaker, ECR, CloudFront, CodeBuild, AppSync, Lambda, DynamoDB, CloudWatch Logs, Glue
+  - Resolves GitHub Issue #50 - templates now deploy correctly in both standard AWS and GovCloud regions
+
+- **Bug Fixes and Code Improvements**
+  - Fixed HITL processing errors in both Pattern-1 (DynamoDB validation with empty strings) and Pattern-2 (string indices error in A2I output processing)
+  - Fixed Step Function UI issues including auto-refresh button auto-disable and fetch failures for failed executions with datetime serialization errors
+  - Cleaned up unused Step Function subscription infrastructure and removed duplicate code in Pattern-2 HITL function
+  - Expanded UI Visual Editor bounding box size with padding for better visibility and user interaction
+  - Fixed bug in list of models supporting cache points - previously claude 4 sonnet and opus had been excluded.
+  - Validations added at the assessment step for checking valid json response. The validation fails after extraction/assessment is complete if json parsing issues are encountered.
+
+
 ## [0.3.14]
 
 ### Added
 
@@ -40,26 +40,34 @@ lint-cicd:
 	fi
 	@echo -e "$(GREEN)All code quality checks passed!$(NC)"
 
-# Check CloudFormation templates for hardcoded AWS partition ARNs
+# Check CloudFormation templates for hardcoded AWS partition ARNs and service principals
 check-arn-partitions:
-	@echo "Checking CloudFormation templates for hardcoded ARN partitions..."
+	@echo "Checking CloudFormation templates for hardcoded ARN partitions and service principals..."
 	@FOUND_ISSUES=0; \
 	for template in template.yaml patterns/*/template.yaml patterns/*/sagemaker_classifier_endpoint.yaml options/*/template.yaml; do \
 		if [ -f "$$template" ]; then \
 			echo "Checking $$template..."; \
-			MATCHES=$$(grep -n "arn:aws:" "$$template" | grep -v "arn:\$${AWS::Partition}:" || true); \
-			if [ -n "$$MATCHES" ]; then \
+			ARN_MATCHES=$$(grep -n "arn:aws:" "$$template" | grep -v "arn:\$${AWS::Partition}:" || true); \
+			if [ -n "$$ARN_MATCHES" ]; then \
 				echo -e "$(RED)ERROR: Found hardcoded 'arn:aws:' references in $$template:$(NC)"; \
-				echo "$$MATCHES" | sed 's/^/  /'; \
+				echo "$$ARN_MATCHES" | sed 's/^/  /'; \
 				echo -e "$(YELLOW)  These should use 'arn:\$${AWS::Partition}:' instead for GovCloud compatibility$(NC)"; \
 				FOUND_ISSUES=1; \
 			fi; \
+			SERVICE_MATCHES=$$(grep -n "\.amazonaws\.com" "$$template" | grep -v "\$${AWS::URLSuffix}" | grep -v "^[[:space:]]*#" | grep -v "Description:" | grep -v "Comment:" | grep -v "cognito" | grep -v "ContentSecurityPolicy" || true); \
+			if [ -n "$$SERVICE_MATCHES" ]; then \
+				echo -e "$(RED)ERROR: Found hardcoded service principal references in $$template:$(NC)"; \
+				echo "$$SERVICE_MATCHES" | sed 's/^/  /'; \
+				echo -e "$(YELLOW)  These should use '\$${AWS::URLSuffix}' instead of 'amazonaws.com' for GovCloud compatibility$(NC)"; \
+				echo -e "$(YELLOW)  Example: 'lambda.amazonaws.com' should be 'lambda.\$${AWS::URLSuffix}'$(NC)"; \
+				FOUND_ISSUES=1; \
+			fi; \
 		fi; \
 	done; \
 	if [ $$FOUND_ISSUES -eq 0 ]; then \
-		echo -e "$(GREEN)✅ No hardcoded ARN partition references found!$(NC)"; \
+		echo -e "$(GREEN)✅ No hardcoded ARN partition or service principal references found!$(NC)"; \
 	else \
-		echo -e "$(RED)❌ Found hardcoded ARN partition references that need to be fixed$(NC)"; \
+		echo -e "$(RED)❌ Found hardcoded references that need to be fixed for GovCloud compatibility$(NC)"; \
 		exit 1; \
 	fi
 
 
@@ -128,6 +128,7 @@ For detailed deployment and testing instructions, see the [Deployment Guide](./d
 - [Agent Analysis](./docs/agent-analysis.md) - Natural language analytics and data visualization feature
 - [Custom MCP Agent](./docs/custom-MCP-agent.md) - Integrating external MCP servers for custom tools and capabilities
 - [Configuration](./docs/configuration.md) - Configuration and customization options
+- [Discovery](./docs/discovery.md) - Pattern-neutral discovery process and BDA blueprint automation
 - [Classification](./docs/classification.md) - Customizing document classification
 - [Extraction](./docs/extraction.md) - Customizing information extraction
 - [Human-in-the-Loop Review](./docs/human-review.md) - Human review workflows with Amazon A2I
 
@@ -1 +1 @@
-0.3.14
+0.3.16
@@ -215,3 +215,105 @@ pricing:
         price: '1.5E-6'
       - name: cacheWriteInputTokens
         price: '1.875E-5'
+discovery:
+  output_format:
+    sample_json: |-
+      {
+          "document_class" : "Form-1040",
+          "document_description" : "Brief summary of the document",
+          "groups" : [
+              {
+                  "name" : "PersonalInformation",
+                  "description" : "Personal information of Tax payer",
+                  "attributeType" : "group",
+                  "groupAttributes" : [
+                      {
+                          "name": "FirstName",
+                          "dataType" : "string",
+                          "description" : "First Name of Taxpayer"
+                      },
+                      {
+                          "name": "Age",
+                          "dataType" : "number",
+                          "description" : "Age of Taxpayer"
+                      }
+                  ]
+              },
+              {
+                  "name" : "Dependents",
+                  "description" : "Dependents of taxpayer",
+                  "attributeType" : "list",
+                  "listItemTemplate": {
+                      "itemAttributes" : [
+                          {
+                              "name": "FirstName",
+                              "dataType" : "string",
+                              "description" : "Dependent first name"
+                          },
+                          {
+                              "name": "Age",
+                              "dataType" : "number",
+                              "description" : "Dependent Age"
+                          }
+                      ]
+                  }
+              }
+          ]
+      }
+  with_ground_truth:
+    top_p: '0.1'
+    temperature: '1.0'
+    user_prompt: >-
+      This image contains unstructured data. Analyze the data line by line using the provided ground truth as reference.                        
+      <GROUND_TRUTH_REFERENCE>
+      {ground_truth_json}
+      </GROUND_TRUTH_REFERENCE>
+      Ground truth reference JSON has the fields we are interested in extracting from the document/image. Use the ground truth to optimize field extraction. Match field names, data types, and groupings from the reference.
+      Image may contain multiple pages, process all pages.
+      Extract all field names including those without values.
+      Do not change the group name and field name from ground truth in the extracted data json.
+      Add field_description field for every field which will contain instruction to LLM to extract the field data from the image/document. Add data_type field for every field. 
+      Add two fields document_class and document_description. 
+      For document_class generate a short name based on the document content like W4, I-9, Paystub. 
+      For document_description generate a description about the document in less than 50 words.
+      If the group repeats and follows table format, update the attributeType as "list".                         
+      Do not extract the values.
+      Format the extracted data using the below JSON format:
+      Format the extracted groups and fields using the below JSON format:
+        
+    model_id: us.amazon.nova-pro-v1:0
+    system_prompt: >-
+      You are an expert in processing forms. Extracting data from images and
+      documents. Use provided ground truth data as reference to optimize field
+      extraction and ensure consistency with expected document structure and
+      field definitions.
+    max_tokens: '10000'
+  without_ground_truth:
+    top_p: '0.1'
+    temperature: '1.0'
+    user_prompt: >-
+      This image contains forms data. Analyze the form line by line.
+      Image may contains multiple pages, process all the pages. 
+      Form may contain multiple name value pair in one line. 
+      Extract all the names in the form including the name value pair which doesn't have value. 
+      Organize them into groups, extract field_name, data_type and field description
+      Field_name should be less than 60 characters, should not have space use '-' instead of space.
+      field_description is a brief description of the field and the location of the field like box number or line number in the form and section of the form.
+      Field_name should be unique within the group.
+      Add two fields document_class and document_description. 
+      For document_class generate a short name based on the document content like W4, I-9, Paystub. 
+      For document_description generate a description about the document in less than 50 words. 
+
+      Group the fields based on the section they are grouped in the form. Group should have attributeType as "group".
+      If the group repeats and follows table format, update the attributeType as "list".
+      Do not extract the values.
+      Return the extracted data in JSON format.
+      Format the extracted data using the below JSON format:
+      Format the extracted groups and fields using the below JSON format:
+    model_id: us.amazon.nova-pro-v1:0
+    system_prompt: >-
+      You are an expert in processing forms. Extracting data from images and
+      documents. Analyze forms line by line to identify field names, data types,
+      and organizational structure. Focus on creating comprehensive blueprints
+      for document processing without extracting actual values.
+    max_tokens: '10000'