Skip to content

Conversation

@itakserman-cloudinary
Copy link

@itakserman-cloudinary itakserman-cloudinary commented Aug 7, 2024

Description of changes:
Deprecation usage of glue session by transitioning Iceberg tables metrics retrieval to pyiceberg

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

Copy link

@amitgilad3 amitgilad3 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

great work, a few questions

lambda/app.py Outdated
catalog = GlueCatalog(glue_db_name)
table = catalog.load_table((glue_db_name, glue_table_name))
logger.info(f"current snapshot id={table.metadata.current_snapshot_id}")
snapshot = table.metadata.snapshot_by_id(table.metadata.current_snapshot_id)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you can use table.current_snapshot() this will bring the current snapshot instance, why do you need the id if you can just get the current one??

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct, I'm using the suggested version in other places. Will update 👍

)
def send_files_metrics(table: Table, snapshot: Snapshot):
logger.info(f"send_files_metrics() -> snapshot_id={snapshot.snapshot_id}")
df = table.inspect.files().to_pandas()

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

take into account that this method can be very expensive , for example a table with thousands of files this can be long , does this metric help with understanding anything ??

in the end if the table is partitioned what matters is the state of partitions and not avg files across entire table

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Initially this metric was exposed. I guess it's a question for the actual template users whether to use it or not...
@moryachok - wdyt ?

Copy link
Contributor

@moryachok moryachok Aug 8, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same could be true for partitions as well - with thousands of partitions this could be expensive. We can do send_files_metrics and send_partitions_metrics optional

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i agree that their can be thousands of partitions but for this the best case would be 1 file per partition which would make it n files but im sure that would not be the case and in reality it much more than 1 file per partition.
their will always be more files than partitions

- CloudWatchPutMetricPolicy: {}
- AWSLambdaBasicExecutionRole
- AmazonS3ReadOnlyAccess
- ECR-Pull-Read-Only
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @itakserman-cloudinary ,
I am getting CloudFormation error when trying to deploy. It looks like you're using ECR-Pull-Read-Only policy which isn't defined in a SAM. There are pre-configured SAM policies that we can reuse, if the one you suggesting isn't there you will need to add relevant permissions as separate statement in the template.yaml.

See the list of SAM policies here:
https://docs.aws.amazon.com/serverless-application-model/latest/developerguide/serverless-policy-templates.html

### Build and Deploy

> ! Important - The guidance below uses AWS Serverless Application Model (SAM) for easier packaging and deployment of AWS Lambda. However if you use your own packaging tool or if you want to deploy AWS Lambda manually you can explore following files:
> ! Important - The guidance below uses AWS Serverless Application Model (SAM) and Amazon ECR for easier packaging and deployment of AWS Lambda. However if you use your own packaging tool or if you want to deploy AWS Lambda manually you can explore following files:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using ECR for lambda packaging block looks good, but I would make it more easier for developers by suggesting bash env vars before the commands, like so:

export CLOUDWATCH_NAMESPACE={{ cw_namespace }}
export AWS_REGION={{ aws_region }}
export aws_account_id={{ aws_account_id }}
export ecr_repository_name={{ repository_name }}
export STACK_NAME={{ your stack name }}
export S3_ARTIFACTS_BUCKET_NAME={{ s3_bucket_name }}
export S3_ARTIFACTS_PATH={{ s3_bucket_path }}
export ecr_repository_uri=${aws_account_id}.dkr.ecr.$AWS_REGION.amazonaws.com/${ecr_repository_name}

Once defined those let them just run the code

docker build -f Dockerfile --platform linux/amd64 -t ${ecr_repository_name}:main --build-arg CLOUDWATCH_NAMESPACE=$CLOUDWATCH_NAMESPACE .
sam build --use-container
aws ecr get-login-password --region $AWS_REGION | docker login --username AWS --password-stdin ${aws_account_id}.dkr.ecr.us-east-1.amazonaws.com
aws ecr create-repository --repository-name $ecr_repository_name --region $AWS_REGION --image-scanning-configuration scanOnPush=true --image-tag-mutability MUTABLE
docker tag ${ecr_repository_name}:main ${ecr_repository_uri}:latest
docker push ${ecr_repository_uri}:latest

sam deploy --debug --region $AWS_REGION \
        --parameter-overrides ImageURL=${ecr_repository_uri}:latest \
        --image-repository $ecr_repository_uri \
        --stack-name $STACK_NAME --capabilities CAPABILITY_IAM CAPABILITY_AUTO_EXPAND \
        --s3-bucket $S3_ARTIFACTS_BUCKET_NAME --s3-prefix $S3_ARTIFACTS_PATH

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@moryachok - done I think :)

README.md Outdated
aws ecr create-repository --repository-name iceberg-monitoring --region {{ aws_region }} --image-scanning-configuration scanOnPush=true --image-tag-mutability MUTABLE
docker tag iceberg-monitoring:main {{ ecr_repository_uri }}:latest
docker push {{ aws_account_id }}.dkr.ecr.{{ aws_region }}.amazonaws.com/iceberg-monitoring:latest
sam deploy --debug --region {{ aws_region }} \
Copy link
Contributor

@moryachok moryachok Aug 30, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that for the very first deploy you have to add --guided attribute

lambda/app.py Outdated
snapshot = table.metadata.snapshot_by_id(table.metadata.current_snapshot_id)
snapshot = table.current_snapshot()
logger.info(f"current snapshot id={snapshot.snapshot_id}")
logger.info("Using glue IS to produce metrics")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's remove this log

@amirgoldman1
Copy link

@moryachok @itakserman-cloudinary Are you planning on merging this PR?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants