Skip to content

v0.3.15

Choose a tag to compare

@rstrahan rstrahan released this 12 Sep 21:46
· 511 commits to main since this release

[0.3.15]

Added

  • Intelligent Document Discovery Module for Automated Configuration Generation

    • Added Discovery module that automatically analyzes document samples to identify structure, field types, and organizational patterns
    • Pattern-Neutral Design: Works across all processing patterns (1, 2, 3) with unified discovery process and pattern-specific implementations
    • Dual Discovery Methods: Discovery without ground truth (exploratory analysis) and with ground truth (optimization using labeled data)
    • Automated Blueprint Creation: Pattern 1 includes zero-touch BDA blueprint generation with intelligent change detection and version management
    • Web UI Integration: Real-time discovery job monitoring, interactive results review, and seamless configuration integration
    • Advanced Features: Multi-model support (Nova, Claude), customizable prompts, configurable parameters, ground truth processing, schema conversion, and lifecycle management
    • Key Benefits: Rapid new document type onboarding, reduced time-to-production, configuration optimization, and automated workflow bootstrapping
    • Use Cases: New document exploration, configuration improvement, rapid prototyping, and document understanding
    • Documentation: Guide in docs/discovery.md with architecture details, best practices, and troubleshooting
  • Optional Pattern-2 Regex-Based Classification for Enhanced Performance

    • Added support for optional regex patterns in document class definitions for performance optimization
    • Document Name Regex: Match against document ID/name to classify all pages without LLM processing when all pages should be the same class
    • Document Page Content Regex: Match against page text content during multi-modal page-level classification for fast page classification
    • Key Benefits: Significant performance improvements and cost savings by bypassing LLM calls for pattern-matched documents, deterministic classification results for known document patterns, seamless fallback to existing LLM classification when regex patterns don't match
    • Configuration: Optional document_name_regex and document_page_content_regex fields in class definitions with automatic regex compilation and validation
    • Logging: Comprehensive info-level logging when regex patterns match for observability and debugging
    • CloudFormation Integration: Updated Pattern-2 schema to support regex configuration through the Web UI
    • Demonstration: New step2_classification_with_regex.ipynb notebook showcasing regex configuration and performance comparisons
    • Documentation: Enhanced classification module README and main documentation with regex usage examples and best practices
  • Windows WSL Development Environment Setup Guide

    • Added WSL-based development environment setup guide for Windows developers in docs/setup-development-env-WSL.md
    • Key Features: Automated setup script (wsl_setup.sh) for quick installation of Git, Python, Node.js, AWS CLI, and SAM CLI
    • Integrated Workflow: Development setup combining Windows tools (VS Code, browsers) with native Linux environment
    • Target Use Cases: Windows developers needing Linux compatibility without Docker Desktop or VM overhead

Fixed

  • Throttling Error Detection and Retry Logic for Assessment Functions - GitHub Issue #45

    • Assessment Function: Enhanced throttling detection to check for throttling errors returned in document.errors field in addition to thrown exceptions, raising ThrottlingException to trigger Step Functions retry when throttling is detected
    • Granular Assessment Task Caching: Fixed caching logic to properly cache successful assessment tasks when there are ANY failed tasks (both exception-based and result-based failures), enabling efficient retry optimization by only reprocessing failed tasks while preserving successful results
    • Impact: Improved resilience for throttling scenarios, reduced redundant processing during retries, and better Step Functions retry behavior
  • Security Vulnerability Mitigation - Package Updates

  • GovCloud Compatibility - Hardcoded Service Domain References

    • Fixed hardcoded amazonaws.com references in CloudFormation templates that prevented GovCloud deployment
    • Updated all service principals and endpoints to use dynamic ${AWS::URLSuffix} expressions for automatic region-based resolution
    • Templates Updated: template.yaml (main template), patterns/pattern-3/sagemaker_classifier_endpoint.yaml
    • Services Fixed: EventBridge, Cognito, SageMaker, ECR, CloudFront, CodeBuild, AppSync, Lambda, DynamoDB, CloudWatch Logs, Glue
    • Resolves GitHub Issue #50 - templates now deploy correctly in both standard AWS and GovCloud regions
  • Bug Fixes and Code Improvements

    • Fixed HITL processing errors in both Pattern-1 (DynamoDB validation with empty strings) and Pattern-2 (string indices error in A2I output processing)
    • Fixed Step Function UI issues including auto-refresh button auto-disable and fetch failures for failed executions with datetime serialization errors
    • Cleaned up unused Step Function subscription infrastructure and removed duplicate code in Pattern-2 HITL function
    • Expanded UI Visual Editor bounding box size with padding for better visibility and user interaction
    • Fixed bug in list of models supporting cache points - previously claude 4 sonnet and opus had been excluded.
    • Validations added at the assessment step for checking valid json response. The validation fails after extraction/assessment is complete if json parsing issues are encountered.