-
Couldn't load subscription status.
- Fork 5
Transition to pyiceberg #5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Transition to pyiceberg #5
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
great work, a few questions
lambda/app.py
Outdated
| catalog = GlueCatalog(glue_db_name) | ||
| table = catalog.load_table((glue_db_name, glue_table_name)) | ||
| logger.info(f"current snapshot id={table.metadata.current_snapshot_id}") | ||
| snapshot = table.metadata.snapshot_by_id(table.metadata.current_snapshot_id) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you can use table.current_snapshot() this will bring the current snapshot instance, why do you need the id if you can just get the current one??
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Correct, I'm using the suggested version in other places. Will update 👍
| ) | ||
| def send_files_metrics(table: Table, snapshot: Snapshot): | ||
| logger.info(f"send_files_metrics() -> snapshot_id={snapshot.snapshot_id}") | ||
| df = table.inspect.files().to_pandas() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
take into account that this method can be very expensive , for example a table with thousands of files this can be long , does this metric help with understanding anything ??
in the end if the table is partitioned what matters is the state of partitions and not avg files across entire table
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Initially this metric was exposed. I guess it's a question for the actual template users whether to use it or not...
@moryachok - wdyt ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same could be true for partitions as well - with thousands of partitions this could be expensive. We can do send_files_metrics and send_partitions_metrics optional
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i agree that their can be thousands of partitions but for this the best case would be 1 file per partition which would make it n files but im sure that would not be the case and in reality it much more than 1 file per partition.
their will always be more files than partitions
lambda/template.yaml
Outdated
| - CloudWatchPutMetricPolicy: {} | ||
| - AWSLambdaBasicExecutionRole | ||
| - AmazonS3ReadOnlyAccess | ||
| - ECR-Pull-Read-Only |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey @itakserman-cloudinary ,
I am getting CloudFormation error when trying to deploy. It looks like you're using ECR-Pull-Read-Only policy which isn't defined in a SAM. There are pre-configured SAM policies that we can reuse, if the one you suggesting isn't there you will need to add relevant permissions as separate statement in the template.yaml.
See the list of SAM policies here:
https://docs.aws.amazon.com/serverless-application-model/latest/developerguide/serverless-policy-templates.html
| ### Build and Deploy | ||
|
|
||
| > ! Important - The guidance below uses AWS Serverless Application Model (SAM) for easier packaging and deployment of AWS Lambda. However if you use your own packaging tool or if you want to deploy AWS Lambda manually you can explore following files: | ||
| > ! Important - The guidance below uses AWS Serverless Application Model (SAM) and Amazon ECR for easier packaging and deployment of AWS Lambda. However if you use your own packaging tool or if you want to deploy AWS Lambda manually you can explore following files: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using ECR for lambda packaging block looks good, but I would make it more easier for developers by suggesting bash env vars before the commands, like so:
export CLOUDWATCH_NAMESPACE={{ cw_namespace }}
export AWS_REGION={{ aws_region }}
export aws_account_id={{ aws_account_id }}
export ecr_repository_name={{ repository_name }}
export STACK_NAME={{ your stack name }}
export S3_ARTIFACTS_BUCKET_NAME={{ s3_bucket_name }}
export S3_ARTIFACTS_PATH={{ s3_bucket_path }}
export ecr_repository_uri=${aws_account_id}.dkr.ecr.$AWS_REGION.amazonaws.com/${ecr_repository_name}Once defined those let them just run the code
docker build -f Dockerfile --platform linux/amd64 -t ${ecr_repository_name}:main --build-arg CLOUDWATCH_NAMESPACE=$CLOUDWATCH_NAMESPACE .
sam build --use-container
aws ecr get-login-password --region $AWS_REGION | docker login --username AWS --password-stdin ${aws_account_id}.dkr.ecr.us-east-1.amazonaws.com
aws ecr create-repository --repository-name $ecr_repository_name --region $AWS_REGION --image-scanning-configuration scanOnPush=true --image-tag-mutability MUTABLE
docker tag ${ecr_repository_name}:main ${ecr_repository_uri}:latest
docker push ${ecr_repository_uri}:latest
sam deploy --debug --region $AWS_REGION \
--parameter-overrides ImageURL=${ecr_repository_uri}:latest \
--image-repository $ecr_repository_uri \
--stack-name $STACK_NAME --capabilities CAPABILITY_IAM CAPABILITY_AUTO_EXPAND \
--s3-bucket $S3_ARTIFACTS_BUCKET_NAME --s3-prefix $S3_ARTIFACTS_PATHThere was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@moryachok - done I think :)
README.md
Outdated
| aws ecr create-repository --repository-name iceberg-monitoring --region {{ aws_region }} --image-scanning-configuration scanOnPush=true --image-tag-mutability MUTABLE | ||
| docker tag iceberg-monitoring:main {{ ecr_repository_uri }}:latest | ||
| docker push {{ aws_account_id }}.dkr.ecr.{{ aws_region }}.amazonaws.com/iceberg-monitoring:latest | ||
| sam deploy --debug --region {{ aws_region }} \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that for the very first deploy you have to add --guided attribute
lambda/app.py
Outdated
| snapshot = table.metadata.snapshot_by_id(table.metadata.current_snapshot_id) | ||
| snapshot = table.current_snapshot() | ||
| logger.info(f"current snapshot id={snapshot.snapshot_id}") | ||
| logger.info("Using glue IS to produce metrics") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let's remove this log
|
@moryachok @itakserman-cloudinary Are you planning on merging this PR? |
Description of changes:
Deprecation usage of glue session by transitioning Iceberg tables metrics retrieval to pyiceberg
By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.