You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
* Lambda changed to work with EventBridge rule
* Template.yaml updated to include Lambda resource-based policy for eventBridge
* Readme.md updated
---------
Authored-by: Michael Greenshtein <greensht@amazon.com>
Copy file name to clipboardExpand all lines: README.md
+41-51Lines changed: 41 additions & 51 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -1,6 +1,6 @@
1
1
## Monitoring Apache Iceberg Table metadata layer using AWS Lambda, AWS Glue and AWS CloudWatch
2
2
3
-
This repository provides you with sample code on how to collect metrics of an existing Apache Iceberg table managed in Amazon S3. The code consists of AWS Lambda deployment package that collects and submits metrics into AWS CloudWatch. Repository also includes helper scripts for deploying CloudWatch monitoring dashboard to visualize collected metrics.
3
+
This repository provides you with sample code that collects metrics of existing Apache Iceberg tables managed in your Amazon S3 and catalogued to AWS Glue Data Catalog. The code consists of AWS Lambda deployment package that collects and submits metrics into AWS CloudWatch. Repository also includes helper script for deploying CloudWatch monitoring dashboard to visualize collected metrics.
@@ -20,9 +20,9 @@ This repository provides you with sample code on how to collect metrics of an ex
20
20
21
21

22
22
23
-
* AWS Lambda triggered on every Iceberg snapshot creation to collect and send metrics to CloudWatch. This achieved with [S3 event notification](https://docs.aws.amazon.com/AmazonS3/latest/userguide/EventNotifications.html). See [Setting up S3 event notification](#3-setting-up-s3-event-notification) section.
23
+
* Amazon EventBridge rule triggers AWS Lambda on every event of *Glue Data Catalog Table State Change*. Event triggered every time transaction committed to Apache Iceberg Table.
24
+
* Triggered AWS Lambda code aggregates information retrieved from metadata tables to create [metrics](#metrics-collected) and submits those to Amazon CloudWatch.
24
25
* AWS Lambda code includes `pyiceberg` library and [AWS Glue interactive Sessions](https://docs.aws.amazon.com/glue/latest/dg/interactive-sessions-overview.html) with minimal compute to read `snapshots`, `partitions` and `files` Apache Iceberg metadata tables with Apache Spark.
25
-
* AWS Lambda code aggregates information retrieved from metadata tables to create metrics and submits those to AWS CloudWatch.
26
26
27
27
28
28
### Metrics collected
@@ -58,7 +58,6 @@ This repository provides you with sample code on how to collect metrics of an ex
58
58
* files.max_record_count
59
59
* files.min_record_count
60
60
* files.deviation_record_count
61
-
* files.skew_record_count
62
61
* files.avg_file_size
63
62
* files.max_file_size
64
63
* files.min_file_size
@@ -112,67 +111,59 @@ sam deploy --guided
112
111
##### Parameters
113
112
114
113
-`CWNamespace` - A namespace is a container for CloudWatch metrics.
115
-
-`DBName` - Glue Data Catalog Database Name.
116
-
-`TableName` - Apache Iceberg Table name as it appears in the Glue Data Catalog.
117
114
-`GlueServiceRole` - AWS Glue Role arn you created [earlier](#configuring-iam-permissions-for-aws-glue).
118
115
-`Warehouse` - Required catalog property to determine the root path of the data warehouse on S3. This can be any path on your S3 bucket. Not critical for the solution.
119
-
-`IcebergTableS3BucketName` - S3 bucket name is required to allow S3 bucket event notification. SAM will add resource-based permission to allow S3 bucket to invoke AWS Lambda.
120
116
121
117
122
-
#### 3. Setting up S3 event notification
118
+
#### 3. Configure EventBridge Trigger
123
119
124
-
You need to setup an automatic trigger that will activate AWS Lambda metrics collection on every Apache Iceberg commit. This solution is relying on S3 event notification feature to trigger AWS Lambda every time new `metadata.json` is written to S3 `metadata` folder of the table.
125
-
126
-
You can follow AWS Documentation on how to [enable and configuring event notifications using the Amazon S3 console](https://docs.aws.amazon.com/AmazonS3/latest/userguide/enable-event-notifications.html).
127
-
128
-
or use the Python Boto3 sample code below. Replace with your bucket name and path to metadata.
120
+
In this section you will configure EventBridge Rule that will trigger Lambda function on every transaction commit to Apache Iceberg table.
121
+
Default rule listens to `Glue Data Catalog Table State Change` event from all the tables in Glue Data Catalog catalog. Lambda code knows to skip non-iceberg tables.
122
+
If you want to scope triggers to specific Iceberg Tables and not collecting metrics from all of them you can uncomment `glue_table_names = ["<<REPLACE TABLE 1>>", "<<REPLACE TABLE 1>>"]` and add relevant table names.
129
123
130
124
```python
131
125
import boto3
132
-
s3_client = boto3.client('s3')
133
-
lambda_arn ="<REPLACE WITH YOUR ARN>"
134
-
bucket_name ="<REPLACE WITH YOUR S3 BUCKET NAME>"
135
-
path_to_metadata_folder ="<REPLACE WITH YOUR S3 PATH>"
126
+
import json
127
+
128
+
# Initialize a boto3 client
129
+
lambda_client = boto3.client('lambda')
130
+
events_client = boto3.client('events')
131
+
132
+
# Parameters
133
+
lambda_function_arn ='<<REPLACE WITH LAMBDA FUNCTION ARN>>'
if response['ResponseMetadata']['HTTPStatusCode'] ==200:
166
-
print("Success")
167
-
else:
168
-
print("Something went wrong")
169
-
164
+
print(f"Pattern updated = {event_pattern_dump}")
170
165
```
171
166
172
-
The final result should look like this
173
-
174
-

175
-
176
167
#### 4. (Optional) Create CloudWatch Dashboard
177
168
Once your Iceberg Table metrics are submitted to CloudWatch you can start using them to monitor and create alarms. CloudWatch also let you visualize metrics using CloudWatch Dashboards.
sql_stmt=f"select committed_at,snapshot_id,operation,summary from glue_catalog.{glue_db_name}.{glue_table_name}.snapshots where snapshot_id={snapshot_id}"
Description: AWS Glue Service Role with permissions to create session.
15
9
Type: String
16
10
Warehouse:
17
11
Description: warehouse is a required catalog property to determine the root path of the data warehouse in storage.
18
12
Type: String
19
-
IcebergTableS3BucketName:
20
-
Description: this property is required to allow S3 bucket event notification. SAM will add resource-based permission to allow S3 bucket to invoke AWS Lambda.
0 commit comments