Skip to content

Convert ConEd meter data from csv to parquet #43

@jpvelez

Description

@jpvelez

What

The ConEd smart meter data is currently sitting in our s3 bucket in CSV format. We should convert them to parquet.

Why

CSV files are human-readable and simple, but they're inefficient for analytical queries. Parquet is a columnar format that solves this by storing data column-by-column with a defined schema. This matters because: (1) columns of the same type compress far better than row-oriented data, so Parquet files are typically 5-10x smaller than equivalent CSVs, and (2) you can query just the columns you need instead of reading entire rows.

Beyond raw compression, Parquet enables predicate pushdown—the ability to filter data before it leaves disk. Instead of loading gigabytes into memory and filtering in Python, Parquet readers (like PyArrow) can push your WHERE clause down to the storage layer and skip irrelevant data entirely. This works through three mechanisms:

Partition pruning divides your dataset into separate files or directories based on a column (e.g., one partition per month or per customer). If you query "WHERE date > '2024-01-01'", Arrow can skip entire partitions without reading them. This is the coarsest and most powerful optimization for large datasets.

Row group filtering works within partitions. Parquet stores data in chunks called row groups, each with min/max statistics on every column. If you query "WHERE meter_id = 12345" and a row group's min/max for meter_id doesn't overlap with that value, Arrow skips it without decompressing. This is fast because stats are tiny.

Dictionary encoding compresses categorical columns (like customer IDs or meter types) by storing a lookup table of unique values instead of repeating strings. This enables fast filtering on those columns and reduces file size dramatically.

The result: faster queries, smaller files, and the ability to run analytical code on datasets that might otherwise require a database.

How

  • Learn Parquet optimization concepts
    • Understand partition pruning, row group filtering, and dictionary encoding by talking to Claude. It helps to ask for code examples and diagrams of the parquet file contents themselves, and it'll work best if you start by giving it examples of the CSVs.
    • Understand when each of these concepts is useful and what tradeoffs exist (e.g., too many partitions = too many small files; too few = less pruning).
    • Write a short summary in a comment, explaining how each concept works, when it's useful, and what tradeoffs exist, to confirm your understanding.
  • Design schema and column data types
    • Examine the raw CSV files
    • Propose a Parquet schema (in a comment below) with good column names and appropriate data types
    • Document why each choice is made—consider both query patterns and compression efficiency
  • Propose partitioning and sort strategy
    • Write up the query patterns (in a comment below) we expect to run most frequently against this smart meter data (e.g., "filter by date range and customer ID," "aggregate energy by meter type and hour"), providing 3-5 concrete query examples using polars
    • Propose a partitioning scheme (what column(s), how granular?) and within-partition sort order that minimizes row group filtering and I/O
    • Justify each choice against the expected queries—explain which predicates will be pruned vs. scanned and why this ordering is efficient
  • Write code to implement csv-to-parquet migration
    • Determine whether to use file-level or multi-file consolidation: Decide if each CSV maps 1:1 to a Parquet file, or if you need to read multiple CSVs into memory, deduplicate/merge them, then write consolidated Parquet files. This affects partitioning strategy—if CSVs overlap in time or customer ID, you may need consolidation before partitioning correctly.
    • Design S3 bucket structure and naming conventions. Propose where to write the Parquet output in the output bucket—consider partitioned directory hierarchies, naming schemes for individual files, and how to separate source CSVs from processed Parquets.
    • Write Polars code for schema inference, casting, and partitioned output. Use Polars to read CSV(s) directly from S3 (using s3:// paths with Polars) with explicit schema and data type casting (int64, Float32, Categorical, Date, etc.), then write to Parquet back on s3 with partition_by() for your chosen partition columns and maintain_order=True + explicit sort columns for within-partition ordering.
      • Test on a subset first to validate schema and partition structure.
    • Design and implement parallel execution with observability
      • Design a strategy to process multiple CSVs concurrently.
      • Ensure your approach includes structured logging and error handling so you can identify which files succeeded, which failed, and why.
      • Consider checkpointing—can you resume from partially completed work, or do you need to restart the entire pipeline?
  • Run code on AWS to perform migration
    • Set up AWS infrastructure and execution environment
      • Launch VM
        • Consider EC2 instance sizing (memory vs. CPU) based on CSV size and consolidation needs
        • Check in when you get to this point, as we'll likely have an easier solution than deploying a raw EC2 machine
      • Install migration code
      • Ensure you can read/write to s3 bucket from VM
    • Lauch migration code in a persistent session
      • Migration code should be invoked by a Just command
      • Migration code should be runnable on any future CSV files that are uploaded to the S3 bucket
    • Monitor and troubleshoot migration until all files have been successfully processed
    • Double-check that output data is complete and correct

Deliverables

  • Comment explaining Parquet optimization concepts
  • Comment proposing a Parquet schema, and justifying choices made
  • Comment proposing a partitioning and sort strategy, and justifying choices made
  • Finalized schema and partitioning/sort strategy, based on feedback on these comments
  • A PR with code that reads in CSV files from s3 and saves them to parquet files in s3 with correct schema / partitioning / sort
  • Complete and correct parquet data in s3 bucket

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions