|
2 | 2 |
|
3 | 3 | ## Current Task Focus |
4 | 4 |
|
5 | | -**User Question**: Understanding OCR processing architecture for large PDFs (500+ pages) in the IDP accelerator, specifically: |
6 | | -1. Is OCR processing sequential or distributed by page? |
7 | | -2. How does Bedrock-only OCR deployment differ? |
8 | | -3. What parts of the system run sequentially vs distributed? |
9 | | -4. Handling massive PDFs with hundreds of forms without clear page boundaries |
| 5 | +**Customer Question**: "We are encountering difficulties deploying your IDP stack outside of a sandbox environment due to an organization-wide Service Control Policy (SCP). This policy mandates the attachment of a Permissions Boundary to any new role. Could you please inform us if it is possible to update the CloudFormation template to include a parameterized Permissions Boundary? Without this update, our ability to transition the code to production will be significantly impeded." |
10 | 6 |
|
11 | | -## Key Findings |
| 7 | +**Task Status**: Implementation phase - Need to add Permissions Boundary parameter support to CloudFormation templates |
12 | 8 |
|
13 | | -### OCR Processing Models |
| 9 | +## Problem Analysis |
14 | 10 |
|
15 | | -The IDP accelerator uses **different processing models depending on the pattern**: |
| 11 | +### Current Situation |
| 12 | +- IDP stack creates numerous IAM roles across main template and pattern templates |
| 13 | +- Organization has SCP requiring Permissions Boundary on all new IAM roles |
| 14 | +- Current templates don't support Permissions Boundary configuration |
| 15 | +- Blocking production deployment |
16 | 16 |
|
17 | | -#### Pattern 1 (BDA): Sequential Internal Processing |
18 | | -- **OCR Approach**: Bedrock Data Automation handles everything internally |
19 | | -- **Processing**: Entire document processed as single unit by BDA service |
20 | | -- **Concurrency**: Not user-controllable, managed by BDA |
21 | | -- **Large Documents**: Subject to BDA service limits and timeouts |
| 17 | +### Affected Templates |
| 18 | +- **Main Template**: `template.yaml` - ~15 IAM roles |
| 19 | +- **Pattern 1**: `patterns/pattern-1/template.yaml` - ~8 IAM roles |
| 20 | +- **Pattern 2**: `patterns/pattern-2/template.yaml` - ~6 roles |
| 21 | +- **Pattern 3**: `patterns/pattern-3/template.yaml` - ~5 roles |
| 22 | +- **Options**: `options/bda-lending-project/template.yaml`, `options/bedrockkb/template.yaml` |
22 | 23 |
|
23 | | -#### Pattern 2/3 (Textract + Bedrock): Distributed Page Processing |
24 | | -- **OCR Approach**: AWS Textract with concurrent page processing |
25 | | -- **Processing**: **Pages processed in parallel** using ThreadPoolExecutor |
26 | | -- **Concurrency**: Configurable (default: 20 concurrent workers) |
27 | | -- **Large Documents**: Optimal for 500+ page documents |
| 24 | +## Solution Design |
28 | 25 |
|
29 | | -### Sequential vs Distributed Components |
| 26 | +### Approach: Parameterized Permissions Boundary |
| 27 | +1. **Add optional parameter** to main template for Permissions Boundary ARN |
| 28 | +2. **Conditionally apply boundary** to all IAM roles when provided |
| 29 | +3. **Maintain backward compatibility** for deployments without boundaries |
| 30 | +4. **Cascade parameter** to all nested pattern stacks |
30 | 31 |
|
31 | | -#### Sequential Processing: |
32 | | -1. **Step Functions Workflow**: OCR → Classification → Extraction → Assessment → Summarization |
33 | | -2. **Classification**: Analyzes all pages to create document boundaries |
34 | | -3. **BDA Internal Processing**: Everything handled as single unit |
| 32 | +### Implementation Plan |
35 | 33 |
|
36 | | -#### Distributed Processing: |
37 | | -1. **OCR Pages (Pattern 2/3)**: Up to 20 pages processed simultaneously |
38 | | -2. **Extraction Sections**: Up to 10 document sections processed in parallel |
39 | | -3. **Independent API Calls**: Each page makes separate Textract calls |
| 34 | +#### Step 1: Main Template Updates (`template.yaml`) |
| 35 | +- Add `PermissionsBoundaryArn` parameter |
| 36 | +- Add `HasPermissionsBoundary` condition |
| 37 | +- Update all IAM role resources with conditional boundary |
| 38 | +- Pass parameter to nested stacks |
| 39 | +- Update CloudFormation interface metadata |
40 | 40 |
|
41 | | -## Customer Scenario Analysis |
| 41 | +#### Step 2: Pattern Template Updates |
| 42 | +- Add parameter to each pattern template |
| 43 | +- Update all IAM roles in patterns |
| 44 | +- Maintain consistency across all patterns |
42 | 45 |
|
43 | | -### 500+ Page PDF with Multiple Forms |
| 46 | +#### Step 3: Options Template Updates |
| 47 | +- Update BDA lending project template |
| 48 | +- Update Bedrock KB template |
44 | 49 |
|
45 | | -**Challenge**: Single PDF containing hundreds of forms without clear page boundaries |
| 50 | +### Key Implementation Details |
46 | 51 |
|
47 | | -**Recommended Approach**: Pattern 2 or 3 for optimal performance |
48 | | - |
49 | | -**Why Pattern 2/3 is Better**: |
50 | | -- **Page-Level Parallelism**: 500 pages processed 20 at a time |
51 | | -- **Memory Efficiency**: Individual pages loaded, not entire document |
52 | | -- **Fault Tolerance**: Page failures don't stop entire processing |
53 | | -- **Granular Control**: Can optimize per-page processing |
54 | | - |
55 | | -**Classification Strategy**: |
56 | | -- Use "holistic" classification method to analyze entire document |
57 | | -- Creates logical sections grouping related pages |
58 | | -- Handles form boundaries that don't align with page boundaries |
59 | | - |
60 | | -## Technical Implementation Details |
| 52 | +**Parameter Definition:** |
| 53 | +```yaml |
| 54 | +PermissionsBoundaryArn: |
| 55 | + Type: String |
| 56 | + Default: "" |
| 57 | + Description: (Optional) ARN of IAM Permissions Boundary policy |
| 58 | + AllowedPattern: "^(|arn:aws:iam::[0-9]{12}:policy/.+)$" |
| 59 | +``` |
61 | 60 |
|
62 | | -### OCR Service Configuration for Large Documents |
| 61 | +**Condition:** |
| 62 | +```yaml |
| 63 | +HasPermissionsBoundary: !Not [!Equals [!Ref PermissionsBoundaryArn, ""]] |
| 64 | +``` |
63 | 65 |
|
| 66 | +**Role Update Pattern:** |
64 | 67 | ```yaml |
65 | | -ocr: |
66 | | - backend: "textract" |
67 | | - max_workers: 20 # Increase for more parallelism |
68 | | - image: |
69 | | - dpi: 150 # Balance quality vs processing time |
70 | | - target_width: 1024 |
71 | | - target_height: 1024 |
72 | | - features: |
73 | | - - name: "LAYOUT" |
74 | | - - name: "TABLES" |
75 | | - - name: "FORMS" |
| 68 | +SomeRole: |
| 69 | + Type: AWS::IAM::Role |
| 70 | + Properties: |
| 71 | + # existing properties... |
| 72 | + PermissionsBoundary: !If [HasPermissionsBoundary, !Ref PermissionsBoundaryArn, !Ref AWS::NoValue] |
76 | 73 | ``` |
77 | 74 |
|
78 | | -### Processing Flow for Large PDFs |
79 | | -
|
80 | | -1. **Document Load**: PyMuPDF loads PDF structure |
81 | | -2. **Page Distribution**: ThreadPoolExecutor creates 20 concurrent workers |
82 | | -3. **Parallel OCR**: Each page processed independently via Textract |
83 | | -4. **Result Assembly**: Pages sorted and combined into document structure |
84 | | -5. **Classification**: Holistic analysis creates logical document sections |
85 | | -6. **Parallel Extraction**: Sections processed concurrently (MaxConcurrency: 10) |
86 | | -
|
87 | | -## Performance Implications |
88 | | -
|
89 | | -### For 500-Page Document: |
90 | | -- **Pattern 1 (BDA)**: Single job, BDA-managed processing |
91 | | -- **Pattern 2/3**: ~25 batches of 20 pages each, highly parallelized |
92 | | -
|
93 | | -### Bottlenecks to Consider: |
94 | | -1. **Textract Rate Limits**: May need to adjust max_workers |
95 | | -2. **Memory Usage**: 20 concurrent pages require significant memory |
96 | | -3. **S3 Operations**: Parallel uploads/downloads for page results |
97 | | -4. **Lambda Timeouts**: Ensure sufficient timeout for large documents |
98 | | -
|
99 | | -## Next Steps and Considerations |
100 | | -
|
101 | | -### For Customer Implementation: |
102 | | -1. **Choose Pattern 2 or 3** for large document processing |
103 | | -2. **Configure max_workers** based on Textract limits and memory |
104 | | -3. **Use holistic classification** to handle form boundaries |
105 | | -4. **Monitor memory usage** during processing |
106 | | -5. **Consider document splitting** if single PDF approach is problematic |
107 | | -
|
108 | | -### Optimization Opportunities: |
109 | | -- **Adaptive Concurrency**: Adjust workers based on document size |
110 | | -- **Progressive Processing**: Start classification while OCR continues |
111 | | -- **Caching Strategy**: Cache page images for reprocessing |
112 | | -- **Error Recovery**: Implement page-level retry with exponential backoff |
| 75 | +## Benefits |
| 76 | +- **SCP Compliance**: Satisfies organizational requirements |
| 77 | +- **Backward Compatible**: Existing deployments unaffected |
| 78 | +- **Flexible**: Works with any Permissions Boundary policy |
| 79 | +- **Comprehensive**: Covers all IAM roles across all components |
| 80 | +
|
| 81 | +## Next Steps |
| 82 | +1. Implement main template changes |
| 83 | +2. Update all pattern templates |
| 84 | +3. Update options templates |
| 85 | +4. Test deployment scenarios |
| 86 | +5. Document usage examples |
0 commit comments