Skip to content

Migrate report data to S3 bucket #30

@jpvelez

Description

@jpvelez

What

In the report analysis notebooks, change all functions that read data from local paths to read from S3 (it's okay to keep local caching files local, the scope is the raw input data):

  1. Read from the S3 bucket s3://data.sb in us-west-2
  2. Time how long these calls take run from a laptop
  3. Time how long they take from a devcontainer in us-west-2 AWS region

Why

To finally migrate the repository over to reading data from our data lake rather than local filesystems.

How

  • Time how long each report takes to render locally
  • Move the following datasets to the bucket:
    • ny_aeba_grid: nyiso_hourly_load.parquet
    • ct_hp_rates: bsf metadata and monthly loads (need to download first)
    • ri_hp_rates: bsf metadata and monthly loads
    • il_lea: il_lea_data.csv though this should probably be read from google sheets
    • il_npa: ??
    • (and any other that I missed that the reports need to run)
  • Update the data read calls to use arrow reading from s3 paths
  • Time how long each report takes the render when reading from S3
  • Time a couple of individual reads functions, especially ones reading large parquet files
  • Repeat the report / function timings on devcontainer
  • The idea is ultimately to learn if the read times are tolerable if working locally, and working a devcontainer in AWS
    • We ultimately want to work in cloud devcontainers anyway, so maybe it's intolerably slow locally but fast enough on AWS, so then we prioritize getting on AWS
    • Or maybe even on AWS it's too close, in which case we can create another issue to create a lightweight caching solution: if you've already downloaded the data before, rerunning the read function should load from disk, not from network.

Deliverables

  • Data in s3
  • PR with functioning read-from-s3 calls
  • Comments in this thread documenting the file and function read times

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions