aws-solutions-library-samples
diff --git a/‎CHANGELOG.md‎
Lines changed: 14 additions & 2 deletions b/‎CHANGELOG.md‎
Lines changed: 14 additions & 2 deletions
diff --git a/‎docs/reporting-database.md‎
Lines changed: 12 additions & 2 deletions b/‎docs/reporting-database.md‎
Lines changed: 12 additions & 2 deletions
diff --git a/‎lib/idp_common_pkg/idp_common/reporting/README.md‎
Lines changed: 14 additions & 4 deletions b/‎lib/idp_common_pkg/idp_common/reporting/README.md‎
Lines changed: 14 additions & 4 deletions
diff --git a/‎lib/idp_common_pkg/idp_common/reporting/save_reporting_data.py‎
Lines changed: 180 additions & 3 deletions b/‎lib/idp_common_pkg/idp_common/reporting/save_reporting_data.py‎
Lines changed: 180 additions & 3 deletions
@@ -5,14 +5,26 @@ SPDX-License-Identifier: MIT-0
 
 ## [Unreleased]
 
+## [0.3.10]
+
+### Added
+
+- **Automatic Glue Table Creation for Document Sections**
+  - Added automatic creation of AWS Glue tables for each document section type (classification) during processing
+  - Tables are created dynamically when new section types are encountered, eliminating manual table creation
+  - Consistent lowercase naming convention for tables ensures compatibility with case-sensitive S3 paths
+  - Tables are configured with partition projection for efficient date-based queries without manual partition management
+  - Automatic schema evolution - tables update when new fields are detected in extraction results
+
+
+## [0.3.9]
+
 ### Added
 - **Optional Permissions Boundary Support for Enterprise Deployments**
   - Added `PermissionsBoundaryArn` parameter to all CloudFormation templates for organizations with Service Control Policies (SCPs) requiring permissions boundaries
   - Comprehensive support for both explicit IAM roles and implicit roles created by AWS SAM functions and statemachines`
   - Conditional implementation ensures backward compatibility - when no permissions boundary is provided, roles deploy normally
 
-## [0.3.8]
-
 ### Added
 - IDP Configuration and Prompting Best Practices documentation [doc](./docs/idp-configuration-best-practices.md)
 
 
@@ -107,11 +107,21 @@ The metering table is particularly valuable for:
 
 ## Document Sections Tables
 
-The document sections tables store the actual extracted data from document sections in a structured format suitable for analytics. These tables are automatically discovered by AWS Glue Crawler and are organized by section type (classification).
+The document sections tables store the actual extracted data from document sections in a structured format suitable for analytics. These tables are automatically created when new section types are encountered during document processing, eliminating the need for manual table creation.
+
+### Automatic Table Creation
+
+When a document is processed and a new section type (classification) is detected, the system automatically:
+1. Creates a new Glue table for that section type (e.g., `document_sections_invoice`, `document_sections_receipt`, `document_sections_w2`)
+2. Configures the table with appropriate schema based on the extracted data
+3. Sets up partition projection for efficient date-based queries
+4. Updates the table schema if new fields are detected in subsequent documents
+
+**Important:** Section type names are normalized to lowercase for consistency with case-sensitive S3 paths. For example, a section classified as "W2" will create a table named `document_sections_w2` with data stored in `document_sections/w2/`.
 
 ### Dynamic Section Tables
 
-Document sections are stored in dynamically created tables based on the section classification. Each section type gets its own table (e.g., `document_sections_invoice`, `document_sections_receipt`, `document_sections_bank_statement`, etc.) with the following characteristics:
+Document sections are stored in dynamically created tables based on the section classification. Each section type gets its own table with the following characteristics:
 
 **Common Metadata Columns:**
 | Column | Type | Description |
 
@@ -199,11 +199,21 @@ The reporting module is designed to work seamlessly with AWS Glue and Amazon Ath
 
 ### Automatic Table Creation
 
-AWS Glue tables are automatically created via CloudFormation with the following features:
+The reporting module provides two types of automatic table creation:
 
-- **Predefined Tables**: `document_evaluations`, `section_evaluations`, `attribute_evaluations`, and `metering` tables
-- **Dynamic Tables**: Document sections tables are automatically discovered by AWS Glue Crawler
-- **Partition Projection**: All tables use partition projection for efficient querying
+#### Predefined Tables (CloudFormation)
+- **Evaluation Tables**: `document_evaluations`, `section_evaluations`, `attribute_evaluations`
+- **Metering Table**: `metering`
+- Created during stack deployment via CloudFormation
+
+#### Dynamic Section Tables (Runtime)
+When processing documents with new section types, the `SaveReportingData` class automatically:
+- **Creates New Tables**: Generates a Glue table for each unique section type (e.g., `document_sections_invoice`, `document_sections_w2`)
+- **Updates Schemas**: Adds new columns when new fields are detected in extraction results
+- **Configures Partitions**: Sets up partition projection for efficient date-based queries
+- **Normalizes Names**: Converts section types to lowercase for S3 path consistency (e.g., "W2" → "w2")
+
+This automatic table creation eliminates manual table management and ensures data is immediately queryable in Athena.
 
 ### Partition Projection Configuration
 
 
@@ -32,15 +32,18 @@ class SaveReportingData:
     to a reporting bucket in Parquet format for analytics.
     """
 
-    def __init__(self, reporting_bucket: str):
+    def __init__(self, reporting_bucket: str, database_name: str = None):
         """
         Initialize the SaveReportingData class.
 
         Args:
             reporting_bucket: S3 bucket name for reporting data
+            database_name: Glue database name for creating tables (optional)
         """
         self.reporting_bucket = reporting_bucket
+        self.database_name = database_name
         self.s3_client = boto3.client("s3")
+        self.glue_client = boto3.client("glue") if database_name else None
 
     def _serialize_value(self, value: Any) -> str:
         """
@@ -305,6 +308,165 @@ def _sanitize_records_for_schema(
 
         return sanitized_records
 
+    def _convert_schema_to_glue_columns(
+        self, schema: pa.Schema
+    ) -> List[Dict[str, str]]:
+        """
+        Convert PyArrow schema to Glue table columns format.
+
+        Args:
+            schema: PyArrow schema
+
+        Returns:
+            List of column definitions for Glue
+        """
+        columns = []
+        for field in schema:
+            # Map PyArrow types to Glue/Hive types
+            if field.type == pa.string():
+                glue_type = "string"
+            elif field.type == pa.bool_():
+                glue_type = "boolean"
+            elif field.type == pa.int64():
+                glue_type = "bigint"
+            elif field.type == pa.int32():
+                glue_type = "int"
+            elif field.type == pa.float64():
+                glue_type = "double"
+            elif field.type == pa.float32():
+                glue_type = "float"
+            elif field.type == pa.timestamp("ms"):
+                glue_type = "timestamp"
+            else:
+                # Default to string for unknown types
+                glue_type = "string"
+
+            columns.append({"Name": field.name, "Type": glue_type})
+
+        return columns
+
+    def _create_or_update_glue_table(
+        self, section_type: str, schema: pa.Schema, new_section_created: bool = False
+    ) -> bool:
+        """
+        Create or update a Glue table for a document section type.
+
+        Args:
+            section_type: The document section type (e.g., 'invoice', 'receipt')
+            schema: PyArrow schema for the table
+            new_section_created: Whether this is a new section type
+
+        Returns:
+            True if table was created or updated, False otherwise
+        """
+        if not self.glue_client or not self.database_name:
+            logger.debug(
+                "Glue client or database name not configured, skipping table creation"
+            )
+            return False
+
+        # Escape section_type to make it table-name-safe
+        escaped_section_type = re.sub(r"[/\\:*?\"<>|]", "_", section_type.lower())
+        table_name = f"document_sections_{escaped_section_type}"
+
+        # Convert schema to Glue columns
+        columns = self._convert_schema_to_glue_columns(schema)
+
+        # Table input for create/update
+        table_input = {
+            "Name": table_name,
+            "Description": f"Document sections table for type: {section_type}",
+            "StorageDescriptor": {
+                "Columns": columns,
+                "Location": f"s3://{self.reporting_bucket}/document_sections/{escaped_section_type}/",
+                "InputFormat": "org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat",
+                "OutputFormat": "org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat",
+                "Compressed": True,
+                "SerdeInfo": {
+                    "SerializationLibrary": "org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe"
+                },
+            },
+            "PartitionKeys": [{"Name": "date", "Type": "string"}],
+            "TableType": "EXTERNAL_TABLE",
+            "Parameters": {
+                "classification": "parquet",
+                "typeOfData": "file",
+                "projection.enabled": "true",
+                "projection.date.type": "date",
+                "projection.date.format": "yyyy-MM-dd",
+                "projection.date.range": "2024-01-01,2030-12-31",
+                "projection.date.interval": "1",
+                "projection.date.interval.unit": "DAYS",
+                "storage.location.template": f"s3://{self.reporting_bucket}/document_sections/{escaped_section_type}/date=${{date}}/",
+            },
+        }
+
+        try:
+            # Try to get the existing table
+            existing_table = self.glue_client.get_table(
+                DatabaseName=self.database_name, Name=table_name
+            )
+
+            # Check if schema has changed significantly
+            existing_columns = (
+                existing_table.get("Table", {})
+                .get("StorageDescriptor", {})
+                .get("Columns", [])
+            )
+            existing_column_names = {col["Name"] for col in existing_columns}
+            new_column_names = {col["Name"] for col in columns}
+
+            # If there are new columns, update the table
+            if new_column_names - existing_column_names:
+                logger.info(f"Updating Glue table {table_name} with new columns")
+                self.glue_client.update_table(
+                    DatabaseName=self.database_name, TableInput=table_input
+                )
+                return True
+            else:
+                logger.debug(
+                    f"Glue table {table_name} already exists with current schema"
+                )
+                return False
+
+        except Exception as get_table_error:
+            # Check if it's an EntityNotFoundException or similar (table doesn't exist)
+            error_str = str(get_table_error)
+            if (
+                "EntityNotFoundException" in error_str
+                or "not found" in error_str.lower()
+            ):
+                # Table doesn't exist, create it
+                logger.info(
+                    f"Creating new Glue table {table_name} for section type: {section_type}"
+                )
+                try:
+                    self.glue_client.create_table(
+                        DatabaseName=self.database_name, TableInput=table_input
+                    )
+                    logger.info(f"Successfully created Glue table {table_name}")
+                    return True
+                except Exception as create_error:
+                    # Check if it's an AlreadyExistsException
+                    if "AlreadyExistsException" in str(create_error):
+                        logger.debug(
+                            f"Glue table {table_name} already exists (race condition)"
+                        )
+                        return False
+                    logger.error(
+                        f"Error creating Glue table {table_name}: {str(create_error)}"
+                    )
+                    return False
+            else:
+                # Some other error occurred
+                logger.error(
+                    f"Error checking Glue table {table_name}: {str(get_table_error)}"
+                )
+                return False
+        except Exception as e:
+            logger.error(f"Error checking/updating Glue table {table_name}: {str(e)}")
+            return False
+
     def save(self, document: Document, data_to_save: List[str]) -> List[Dict[str, Any]]:
         """
         Save document data based on the data_to_save list.
@@ -747,6 +909,7 @@ def save_document_sections(self, document: Document) -> Optional[Dict[str, Any]]
         sections_processed = 0
         sections_with_errors = 0
         total_records_saved = 0
+        section_types_processed = set()  # Track unique section types
 
         logger.info(
             f"Processing {len(document.sections)} sections for document {document_id}"
@@ -845,8 +1008,10 @@ def save_document_sections(self, document: Document) -> Optional[Dict[str, Any]]
                 section_type = (
                     section.classification if section.classification else "unknown"
                 )
-                # Escape section_type to make it filesystem-safe
-                escaped_section_type = re.sub(r"[/\\:*?\"<>|]", "_", section_type)
+                # Escape section_type to make it filesystem-safe and lowercase for consistency
+                escaped_section_type = re.sub(
+                    r"[/\\:*?\"<>|]", "_", section_type.lower()
+                )
 
                 s3_key = (
                     f"document_sections/"
@@ -866,6 +1031,18 @@ def save_document_sections(self, document: Document) -> Optional[Dict[str, Any]]
                     f"to s3://{self.reporting_bucket}/{s3_key}"
                 )
 
+                # Track this section type and create/update Glue table if needed
+                if section_type not in section_types_processed:
+                    section_types_processed.add(section_type)
+                    # Try to create or update the Glue table for this section type
+                    table_created = self._create_or_update_glue_table(
+                        section_type, schema
+                    )
+                    if table_created:
+                        logger.info(
+                            f"Created/updated Glue table for section type: {section_type}"
+                        )
+
             except Exception as e:
                 logger.error(f"Error processing section {section.section_id}: {str(e)}")
                 sections_with_errors += 1