-
Notifications
You must be signed in to change notification settings - Fork 1
Open
Description
What
In the report analysis notebooks, change all functions that read data from local paths to read from S3 (it's okay to keep local caching files local, the scope is the raw input data):
- Read from the S3 bucket
s3://data.sbinus-west-2 - Time how long these calls take run from a laptop
- Time how long they take from a devcontainer in
us-west-2AWS region
Why
To finally migrate the repository over to reading data from our data lake rather than local filesystems.
How
- Time how long each report takes to render locally
- Move the following datasets to the bucket:
ny_aeba_grid:nyiso_hourly_load.parquetct_hp_rates: bsf metadata and monthly loads (need to download first)ri_hp_rates: bsf metadata and monthly loadsil_lea:il_lea_data.csvthough this should probably be read from google sheetsil_npa: ??- (and any other that I missed that the reports need to run)
- Update the data read calls to use arrow reading from s3 paths
- Time how long each report takes the render when reading from S3
- Time a couple of individual reads functions, especially ones reading large parquet files
- Repeat the report / function timings on devcontainer
- The idea is ultimately to learn if the read times are tolerable if working locally, and working a devcontainer in AWS
- We ultimately want to work in cloud devcontainers anyway, so maybe it's intolerably slow locally but fast enough on AWS, so then we prioritize getting on AWS
- Or maybe even on AWS it's too close, in which case we can create another issue to create a lightweight caching solution: if you've already downloaded the data before, rerunning the read function should load from disk, not from network.
Deliverables
- Data in s3
- PR with functioning read-from-s3 calls
- Comments in this thread documenting the file and function read times
Metadata
Metadata
Assignees
Labels
No labels