Create tool to monitor S3 prefixes

Folks at PEDP would like to keep an eye on the status of a bunch of datasets that are a collection of files in an S3 bucket with a common prefix. For example:

- `s3://nrel-pds-wtk/bchrrr/v1.0.0`
- `s3://nrel-pds-wtk/canada/v1.0.0`

You can’t treat this like an FTP directory and browse to the corresponding URL (e.g. `https://nrel-pds-wtk.s3.amazonaws.com/bchrrr/v1.0.0/`) to get a list of files (depending on bucket configuration, you can do this at the *root* of the bucket, but nowhere else). Instead, we need to develop something more specialized to keep an eye on these files.

The AWS CLI can already do a lot of this for us, so we may just need some scripting around that:

```sh
> aws s3 ls --summarize --recursive s3://nrel-pds-wtk/bchrrr/v1.0.0/
2024-11-19 12:43:09 1779388467382 bchrrr/v1.0.0/bchrrr_conus_2015.h5
2024-11-19 12:55:32 1784947575862 bchrrr/v1.0.0/bchrrr_conus_2016.h5
2024-11-19 12:55:32 1779388467382 bchrrr/v1.0.0/bchrrr_conus_2017.h5
2024-11-19 12:55:32 1779388467382 bchrrr/v1.0.0/bchrrr_conus_2018.h5
2024-11-19 12:55:32 1779388463286 bchrrr/v1.0.0/bchrrr_conus_2019.h5
2024-11-19 18:56:10 1784947571766 bchrrr/v1.0.0/bchrrr_conus_2020.h5
2024-11-19 19:52:51 1779388463286 bchrrr/v1.0.0/bchrrr_conus_2021.h5
2024-11-19 19:53:54 1779388463286 bchrrr/v1.0.0/bchrrr_conus_2022.h5
2024-11-19 19:54:02 1779388463286 bchrrr/v1.0.0/bchrrr_conus_2023.h5

Total Objects: 9
   Total Size: 16025614402918
```

Alternatively, you could use some client library to do this and wrap in some custom code to produce some kind of fancier output (e.g. in Python: [`list_objects_v2`](https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/s3/client/list_objects_v2.html) or [`S3.Paginator.ListObjectsV2`](https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/s3/paginator/ListObjectsV2.html)).

At its most simple, I think we just need something to:
- Run on a set schedule (e.g. 1/day).
- Iterate over each S3 prefix we want to monitor.
- List all the files, sizes, and modified-times with the given prefix.
- Handle situations where the prefix disappears entirely.
- Save the results somewhere so they can be compared, run to run.

The plainest version of this might just be a GitHub action that runs and saves the results to a GH repo, where someone can see diffs between different runs.

Ideally, this might also have something to send people an e-mail or Slack message if files are *deleted*.

Fancier stuff that would probably be useful:
- Produce a WARC of some sort so other archiving tools can consume it in a semi-standardized way. Upload this WARC to Internet Archive.
- Produce some readable text or HTML output that we could import into the Web Monitoring database and display at monitoring.envirodatagov.org (and therefore also include in weekly analyst sheets).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Create tool to monitor S3 prefixes #177

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

Create tool to monitor S3 prefixes #177

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions