Skip to content

Create tool to monitor S3 prefixes #177

@Mr0grog

Description

@Mr0grog

Folks at PEDP would like to keep an eye on the status of a bunch of datasets that are a collection of files in an S3 bucket with a common prefix. For example:

  • s3://nrel-pds-wtk/bchrrr/v1.0.0
  • s3://nrel-pds-wtk/canada/v1.0.0

You can’t treat this like an FTP directory and browse to the corresponding URL (e.g. https://nrel-pds-wtk.s3.amazonaws.com/bchrrr/v1.0.0/) to get a list of files (depending on bucket configuration, you can do this at the root of the bucket, but nowhere else). Instead, we need to develop something more specialized to keep an eye on these files.

The AWS CLI can already do a lot of this for us, so we may just need some scripting around that:

> aws s3 ls --summarize --recursive s3://nrel-pds-wtk/bchrrr/v1.0.0/
2024-11-19 12:43:09 1779388467382 bchrrr/v1.0.0/bchrrr_conus_2015.h5
2024-11-19 12:55:32 1784947575862 bchrrr/v1.0.0/bchrrr_conus_2016.h5
2024-11-19 12:55:32 1779388467382 bchrrr/v1.0.0/bchrrr_conus_2017.h5
2024-11-19 12:55:32 1779388467382 bchrrr/v1.0.0/bchrrr_conus_2018.h5
2024-11-19 12:55:32 1779388463286 bchrrr/v1.0.0/bchrrr_conus_2019.h5
2024-11-19 18:56:10 1784947571766 bchrrr/v1.0.0/bchrrr_conus_2020.h5
2024-11-19 19:52:51 1779388463286 bchrrr/v1.0.0/bchrrr_conus_2021.h5
2024-11-19 19:53:54 1779388463286 bchrrr/v1.0.0/bchrrr_conus_2022.h5
2024-11-19 19:54:02 1779388463286 bchrrr/v1.0.0/bchrrr_conus_2023.h5

Total Objects: 9
   Total Size: 16025614402918

Alternatively, you could use some client library to do this and wrap in some custom code to produce some kind of fancier output (e.g. in Python: list_objects_v2 or S3.Paginator.ListObjectsV2).

At its most simple, I think we just need something to:

  • Run on a set schedule (e.g. 1/day).
  • Iterate over each S3 prefix we want to monitor.
  • List all the files, sizes, and modified-times with the given prefix.
  • Handle situations where the prefix disappears entirely.
  • Save the results somewhere so they can be compared, run to run.

The plainest version of this might just be a GitHub action that runs and saves the results to a GH repo, where someone can see diffs between different runs.

Ideally, this might also have something to send people an e-mail or Slack message if files are deleted.

Fancier stuff that would probably be useful:

  • Produce a WARC of some sort so other archiving tools can consume it in a semi-standardized way. Upload this WARC to Internet Archive.
  • Produce some readable text or HTML output that we could import into the Web Monitoring database and display at monitoring.envirodatagov.org (and therefore also include in weekly analyst sheets).

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    Status

    Prioritized

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions