v0.0.2-preview

moryachok · web-flow · commit bf3568927548 · 2024-05-01T16:17:33.000+03:00
* Lambda changed to work with EventBridge rule
* Template.yaml updated to include Lambda resource-based policy for eventBridge
* Readme.md updated

---------

Authored-by: Michael Greenshtein &lt;greensht@amazon.com&gt;
diff --git a/.gitignore b/.gitignore
@@ -3,4 +3,5 @@ temp
 assets/cloudwatch-dashboard.rendered.json
 samconfig.toml
 .aws-sam
-.env.local.json
+.env.local.json
+events/my.event.json
diff --git a/README.md b/README.md
@@ -1,6 +1,6 @@
 ## Monitoring Apache Iceberg Table metadata layer using AWS Lambda, AWS Glue and AWS CloudWatch
 
-This repository provides you with sample code on how to collect metrics of an existing Apache Iceberg table managed in Amazon S3. The code consists of AWS Lambda deployment package that collects and submits metrics into AWS CloudWatch. Repository also includes helper scripts for deploying CloudWatch monitoring dashboard to visualize collected metrics.
+This repository provides you with sample code that collects metrics of existing Apache Iceberg tables managed in your Amazon S3 and catalogued to AWS Glue Data Catalog. The code consists of AWS Lambda deployment package that collects and submits metrics into AWS CloudWatch. Repository also includes helper script for deploying CloudWatch monitoring dashboard to visualize collected metrics.
 
 ### Table of Contents
 - [Technical implementation](#technical-implementation)
@@ -20,9 +20,9 @@ This repository provides you with sample code on how to collect metrics of an ex
 
 ![Architectural diagram of the solution](assets/arch.png)
 
-* AWS Lambda triggered on every Iceberg snapshot creation to collect and send metrics to CloudWatch. This achieved with [S3 event notification](https://docs.aws.amazon.com/AmazonS3/latest/userguide/EventNotifications.html). See [Setting up S3 event notification](#3-setting-up-s3-event-notification) section.
+* Amazon EventBridge rule triggers AWS Lambda on every event of  *Glue Data Catalog Table State Change*. Event triggered every time transaction committed to Apache Iceberg Table.
+* Triggered AWS Lambda code aggregates information retrieved from metadata tables to create [metrics](#metrics-collected) and submits those to Amazon CloudWatch.
 * AWS Lambda code includes `pyiceberg` library and [AWS Glue interactive Sessions](https://docs.aws.amazon.com/glue/latest/dg/interactive-sessions-overview.html) with minimal compute to read `snapshots`, `partitions` and `files` Apache Iceberg metadata tables with Apache Spark.
-* AWS Lambda code aggregates information retrieved from metadata tables to create metrics and submits those to AWS CloudWatch.
 
 
 ### Metrics collected
@@ -58,7 +58,6 @@ This repository provides you with sample code on how to collect metrics of an ex
 * files.max_record_count
 * files.min_record_count
 * files.deviation_record_count
-* files.skew_record_count
 * files.avg_file_size
 * files.max_file_size
 * files.min_file_size
@@ -112,67 +111,59 @@ sam deploy --guided
 ##### Parameters
 
 - `CWNamespace` - A namespace is a container for CloudWatch metrics.
-- `DBName` - Glue Data Catalog Database Name.
-- `TableName` - Apache Iceberg Table name as it appears in the Glue Data Catalog.
 - `GlueServiceRole` - AWS Glue Role arn you created [earlier](#configuring-iam-permissions-for-aws-glue).
 - `Warehouse` - Required catalog property to determine the root path of the data warehouse on S3. This can be any path on your S3 bucket. Not critical for the solution.
-- `IcebergTableS3BucketName` - S3 bucket name is required to allow S3 bucket event notification. SAM will add resource-based permission to allow S3 bucket to invoke AWS Lambda.
 
 
-#### 3. Setting up S3 event notification
+#### 3. Configure EventBridge Trigger
 
-You need to setup an automatic trigger that will activate AWS Lambda metrics collection on every Apache Iceberg commit. This solution is relying on S3 event notification feature to trigger AWS Lambda every time new `metadata.json` is written to S3 `metadata` folder of the table.
-
-You can follow AWS Documentation on how to [enable and configuring event notifications using the Amazon S3 console](https://docs.aws.amazon.com/AmazonS3/latest/userguide/enable-event-notifications.html).
-
-or use the Python Boto3 sample code below. Replace with your bucket name and path to metadata.
+In this section you will configure EventBridge Rule that will trigger Lambda function on every transaction commit to Apache Iceberg table.
+Default rule listens to `Glue Data Catalog Table State Change` event from all the tables in Glue Data Catalog catalog. Lambda code knows to skip non-iceberg tables.
+If you want to scope triggers to specific Iceberg Tables and not collecting metrics from all of them you can uncomment `glue_table_names = ["<<REPLACE TABLE 1>>", "<<REPLACE TABLE 1>>"]` and add relevant table names.
 
 ```python
 import boto3
-s3_client = boto3.client('s3')
-lambda_arn = "<REPLACE WITH YOUR ARN>"
-bucket_name = "<REPLACE WITH YOUR S3 BUCKET NAME>"
-path_to_metadata_folder = "<REPLACE WITH YOUR S3 PATH>"
+import json
+
+# Initialize a boto3 client
+lambda_client = boto3.client('lambda')
+events_client = boto3.client('events')
+
+# Parameters
+lambda_function_arn = '<<REPLACE WITH LAMBDA FUNCTION ARN>>'
+glue_table_names = None
+# glue_table_names = ["<<REPLACE TABLE 1>>", "<<REPLACE TABLE 1>>"]
+
+# Create EventBridge Rule
+event_pattern = {
+    "source": ["aws.glue"],
+    "detail-type": ["Glue Data Catalog Table State Change"]
+}
 
-notification_configuration = {
-    'LambdaFunctionConfigurations': [
+if glue_table_names:
+    event_pattern
+    event_pattern["detail"] = {
+        "tableName":  glue_table_names   
+    }
+event_pattern_dump = json.dumps(event_pattern)
+rule_response = events_client.put_rule(
+    Name='IcebergTablesUpdateRule',
+    EventPattern=event_pattern_dump,
+    State='ENABLED'
+)
+# Add Lambda as a target to the EventBridge Rule
+events_client.put_targets(
+    Rule='IcebergTablesUpdateRule',
+    Targets=[
         {
-            'LambdaFunctionArn': lambda_arn,
-            'Events': [
-                's3:ObjectCreated:Put'
-            ],
-            'Filter': {
-                'Key': {
-                    'FilterRules': [
-                        {
-                            'Name': 'Prefix',
-                            'Value': path_to_metadata_folder
-                        },
-                        {
-                            'Name': 'Suffix',
-                            'Value': '.json'
-                        }
-                    ]
-                }
-            }
+            'Id': '1',
+            'Arn': lambda_function_arn
         }
     ]
-}
-response = s3_client.put_bucket_notification_configuration(
-    Bucket=bucket_name,
-    NotificationConfiguration=notification_configuration
 )
-if response['ResponseMetadata']['HTTPStatusCode'] == 200:
-    print("Success")
-else:
-    print("Something went wrong")
-
+print(f"Pattern updated = {event_pattern_dump}")
 ```
 
-The final result should look like this
-
-![S3 to AWS Lambda trigger example](assets/trigger.png)
-
 #### 4. (Optional) Create CloudWatch Dashboard
 Once your Iceberg Table metrics are submitted to CloudWatch you can start using them to monitor and create alarms. CloudWatch also let you visualize metrics using CloudWatch Dashboards.
 
@@ -255,7 +246,6 @@ https://docs.docker.com/get-docker/
 
 1. Delete AWS Lambda `sam delete`.
 2. Delete CloudWatch Dashboard.
-3. Remove S3 event notification.
 
 ## Security
 
diff --git a/assets/arch.png b/assets/arch.png
diff --git a/assets/trigger.png b/assets/trigger.png
diff --git a/events/event.json b/events/event.json
@@ -0,0 +1,18 @@
+{
+    "version": "0",
+    "id": "f9155a75-bc7c-d7b8-e240-asd234fdfg",
+    "detail-type": "Glue Data Catalog Table State Change",
+    "source": "aws.glue",
+    "account": "12345678910",
+    "time": "2024-04-29T19:11:21Z",
+    "region": "us-east-1",
+    "resources": [
+        "arn:aws:glue:us-east-1:12345678910:table/sample_db_name/sample_iceberg_table"
+    ],
+    "detail": {
+        "databaseName": "sample_db_name",
+        "changedPartitions": [],
+        "typeOfChange": "UpdateTable",
+        "tableName": "sample_iceberg_table"
+    }
+}
diff --git a/lambda/app.py b/lambda/app.py
@@ -14,16 +14,14 @@
 
 glue_client = boto3.client('glue')
 
-required_vars = ['DBNAME', 'TABLENAME', 'CW_NAMESPACE', 'GLUE_SERVICE_ROLE', 'SPARK_CATALOG_S3_WAREHOUSE']
+required_vars = ['CW_NAMESPACE', 'GLUE_SERVICE_ROLE', 'SPARK_CATALOG_S3_WAREHOUSE']
 for var in required_vars:
     # Retrieve the environment variable value
     if os.getenv(var) is None:
         # If any variable is not set, raise an exception
         raise EnvironmentError(f"Required environment variable '{var}' is not set.")
     
 cw_namespace = os.environ.get('CW_NAMESPACE')
-glue_db_name = os.environ.get('DBNAME')
-glue_table_name = os.environ.get('TABLENAME')
 glue_service_role = os.environ.get('GLUE_SERVICE_ROLE')
 warehouse_path = os.environ.get('SPARK_CATALOG_S3_WAREHOUSE')
 
@@ -83,18 +81,26 @@ def wait_for_statement(session_id,statement_id,interval=1):
             logger.info(f"Statement status={status}")
             return response
         time.sleep(interval)
-        
-def parse_statement_result(data_str, columns):
-    # Split the string into lines and filter out the irrelevant ones
-    lines = data_str.split('\n')[3:-3]  # Ignore the header and footer lines
-    # Split each line into components and strip the whitespace
-    data = [line.split('|')[1:-1] for line in lines]  # Remove empty strings at start and end
-    data = [[item.strip() for item in row] for row in data]  # Strip whitespace from each item
-    df = pd.DataFrame(data, columns=columns)
-    return df
-        
 
-def send_files_metrics(snapshot,session_id):
+    
+    
+def parse_spark_show_output(output):
+    lines = output.strip().split('\n')
+    header = lines[1]  # Column names are typically in the second line
+    columns = [col.strip() for col in header.split('|') if col.strip()]  # Clean and split by '|'
+
+    data = []
+    # Start reading data from the third line and skip the last line which is a border
+    for row in lines[3:-1]:
+        # Remove border and split
+        row_data = [cell.strip() for cell in row.split('|') if cell.strip()]
+        if row_data:
+            data.append(row_data)
+
+    # Create DataFrame
+    return pd.DataFrame(data, columns=columns)   
+
+def send_files_metrics(glue_db_name, glue_table_name, snapshot,session_id):
     sql_stmt = f"select file_path,record_count,file_size_in_bytes from glue_catalog.{glue_db_name}.{glue_table_name}.files"    
     run_stmt_response = glue_client.run_statement(
         SessionId=session_id,
@@ -104,15 +110,13 @@ def send_files_metrics(snapshot,session_id):
     logger.info(f"select files statement_id={stmt_id}")
     stmt_response = wait_for_statement(session_id, run_stmt_response["Id"])
     data_str = stmt_response["Statement"]["Output"]["Data"]["TextPlain"]
-    files_metrics_columns = ["file_path","record_count", "file_size_in_bytes"]
-    df = parse_statement_result(data_str,files_metrics_columns)
-    
+    logger.info(stmt_response)
+    df = parse_spark_show_output(data_str)
     file_metrics = {
         "avg_record_count": df["record_count"].astype(int).mean().astype(int),
         "max_record_count": df["record_count"].astype(int).max(),
         "min_record_count": df["record_count"].astype(int).min(),
         "deviation_record_count": df['record_count'].astype(int).std().round(2),
-        "skew_record_count": df['record_count'].astype(int).skew().round(2),
         "avg_file_size": df['file_size_in_bytes'].astype(int).mean().astype(int),
         "max_file_size": df['file_size_in_bytes'].astype(int).max(),
         "min_file_size": df['file_size_in_bytes'].astype(int).min(),
@@ -135,7 +139,7 @@ def send_files_metrics(snapshot,session_id):
         )
     
 
-def send_partition_metrics(snapshot,session_id):
+def send_partition_metrics(glue_db_name, glue_table_name, snapshot,session_id):
     sql_stmt = f"select partition,record_count,file_count from glue_catalog.{glue_db_name}.{glue_table_name}.partitions"    
     run_stmt_response = glue_client.run_statement(
         SessionId=session_id,
@@ -146,8 +150,12 @@ def send_partition_metrics(snapshot,session_id):
     logger.info(f"send_partition_metrics() -> statement_id={stmt_id}")
     stmt_response = wait_for_statement(session_id, stmt_id)
     data_str = stmt_response["Statement"]["Output"]["Data"]["TextPlain"]
-    partition_metrics_columns = ['partition', 'record_count', 'file_count']
-    df = parse_statement_result(data_str,partition_metrics_columns)
+
+    if data_str == "":
+        logger.info("No partitions found")
+        return
+    
+    df = parse_spark_show_output(data_str)
     partition_metrics = {
         "avg_record_count": df["record_count"].astype(int).mean().astype(int),
         "max_record_count": df["record_count"].astype(int).max(),
@@ -249,9 +257,11 @@ def dt_to_ts(dt_str):
     timestamp_seconds = dt_obj.timestamp()
     return int(timestamp_seconds * 1000)
 
-def send_snapshot_metrics(snapshot_id, session_id):
+
+def send_snapshot_metrics(glue_db_name, glue_table_name, snapshot_id, session_id):
     logger.info("send_snapshot_metrics")
     sql_stmt = f"select committed_at,snapshot_id,operation,summary from glue_catalog.{glue_db_name}.{glue_table_name}.snapshots where snapshot_id={snapshot_id}"
+    logger.debug(sql_stmt)
     run_stmt_response = glue_client.run_statement(
         SessionId=session_id,
         Code=f"df=spark.sql(\"{sql_stmt}\");json_rdd=df.toJSON();json_strings=json_rdd.collect();print(json_strings)"
@@ -288,20 +298,38 @@ def send_snapshot_metrics(snapshot_id, session_id):
             timestamp = timestamp_ms,
         ) 
 
+# check if glue table is of iceberg format, return boolean
+def check_table_is_of_iceberg_format(event):
+    response = glue_client.get_table(
+        DatabaseName=event["detail"]["databaseName"],
+        Name=event["detail"]["tableName"],
+    )
+    try:
+        return response["Table"]["Parameters"]["table_type"] == "ICEBERG"
+    except KeyError:
+        logger.warning("check_table_is_of_iceberg_format() -> table_type is missing")
+        return False
+    
 
 def lambda_handler(event, context):
-    log_format = f"[{context.aws_request_id}:%(asctime)s.%(msecs)03d] %(message)s"
-    logging.basicConfig(format=log_format, datefmt='%Y-%m-%d %H:%M:%S', level=logging.INFO)
+    log_format = f"[{context.aws_request_id}:%(message)s"
+    logging.basicConfig(format=log_format, level=logging.INFO)
     
+    # Ensure Table is of Iceberg format.
+    if not check_table_is_of_iceberg_format(event):
+        logger.info("Table is not of Iceberg format, skipping metrics generation")
+        return
     
-    catalog = GlueCatalog("default")
+    glue_db_name = event["detail"]["databaseName"]
+    glue_table_name =  event["detail"]["tableName"]
+    
+    catalog = GlueCatalog(glue_db_name)
     table = catalog.load_table((glue_db_name, glue_table_name))
     logger.info(f"current snapshot id={table.metadata.current_snapshot_id}")
     snapshot = table.metadata.snapshot_by_id(table.metadata.current_snapshot_id)
-    
     logger.info("Using glue IS to produce metrics")
     session_id = create_or_reuse_glue_session()
     
-    send_snapshot_metrics(table.metadata.current_snapshot_id, session_id)
-    send_partition_metrics(snapshot,session_id)
-    send_files_metrics(snapshot,session_id)
+    send_snapshot_metrics(glue_db_name, glue_table_name, table.metadata.current_snapshot_id, session_id)
+    send_partition_metrics(glue_db_name, glue_table_name, snapshot,session_id)
+    send_files_metrics(glue_db_name, glue_table_name, snapshot,session_id)
diff --git a/template.yaml b/template.yaml
@@ -4,21 +4,12 @@ Parameters:
   CWNamespace: 
     Description: Amazon CloudWatch custom metric namespace
     Type: String
-  DBName: 
-    Description: AWS Glue Database Name
-    Type: String
-  TableName: 
-    Description: AWS Glue Table Name
-    Type: String
   GlueServiceRole: 
     Description: AWS Glue Service Role with permissions to create session.
     Type: String
   Warehouse: 
     Description: warehouse is a required catalog property to determine the root path of the data warehouse in storage.
     Type: String
-  IcebergTableS3BucketName: 
-    Description: this property is required to allow S3 bucket event notification. SAM will add resource-based permission to allow S3 bucket to invoke AWS Lambda.
-    Type: String
 Globals:
   Function:
     Timeout: 360
@@ -32,9 +23,8 @@ Resources:
       Runtime: python3.9
       Policies: 
         - CloudWatchPutMetricPolicy: {}
-        - S3ReadPolicy:
-            BucketName: !Ref IcebergTableS3BucketName
         - AWSLambdaBasicExecutionRole
+        - AmazonS3ReadOnlyAccess
         - Statement:
           - Sid: GlueInteractiveSessionPolicy
             Effect: Allow
@@ -68,20 +58,16 @@ Resources:
         - x86_64
       Environment:
         Variables:
-          DBNAME: !Ref DBName
-          TABLENAME: !Ref TableName
           CW_NAMESPACE: !Ref CWNamespace
           GLUE_SERVICE_ROLE: !Ref GlueServiceRole
           SPARK_CATALOG_S3_WAREHOUSE: !Ref Warehouse
-  S3LambdaInvokeResourceBasedPermission:
+  PermissionForEventBridgeToInvokeLambda:
     Type: AWS::Lambda::Permission
     Properties:
-      FunctionName: !GetAtt IcebergMetricsLambda.Arn
-      Action: lambda:InvokeFunction
-      Principal: s3.amazonaws.com
-      SourceAccount: !Ref 'AWS::AccountId'
-      SourceArn: !Sub 'arn:aws:s3:::${IcebergTableS3BucketName}'
-
+      FunctionName: !Ref IcebergMetricsLambda
+      Action: 'lambda:InvokeFunction'
+      Principal: 'events.amazonaws.com'
+      SourceArn: !Sub 'arn:aws:events:${AWS::Region}:${AWS::AccountId}:rule/IcebergTablesUpdateRule*'
 Outputs:
   IcebergMetricsLambda:
     Description: Lambda Function ARN