Skip to content

[C++][Parquet][Python] New API to 'zip' or (vertically) 'attach' parquet metadata #40958

@mrbrahman

Description

@mrbrahman

Describe the enhancement requested

Hi,

One of the design principles of parquet from their Github page is 'Separating metadata and column data':

Separating metadata and column data.

The format is explicitly designed to separate the metadata from the data. This allows splitting columns into multiple files, as well as having a single metadata file reference multiple parquet files.

In order to achieve the 'columns in different files', we need to

  1. Ensure each file has the same number of row-groups
  2. Ensure each corresponding row-group of each file have the same rows
  3. Grab the 'metadata' from each file, 'zip/attach them vertically', and write out the new metadata file
  4. Feed this metadata while reading the table

It looks like Arrow APIs provide nearly everything to achieve this, except for the bolded portion in point 3 above.

This ticket is requesting the addition of a new API to be able to 'zip'/'join'/'attach' metadata from 2 files.

For e.g.

import pyarrow.parquet as pq
m1 = pq.read_metadata('file1.parquet')  # say this has columns: col1, col2, col3
m1.set_file_path('file1.parquet')

m2 = pq.read_metadata('file2.parquet')  # say this has columns: col4, col5
m2.set_file_path('file2.parquet')

# requesting this new 'zip' API
m = m1.zip(m2)  # needs to ensure same number of row groups, and same number of rows within each row group

# m will now have metadata for col1, col2, col3, col4, col5 each pointing to appropriate data file

m.write_metadata('_metadata')

One this is done, a combined data can be created using:

m = pq.read_metadata('_metadata')
data = pq.ParquetFile('file1.parquet', 'file2.parquet', metadata=m)

# data should now be able to show all columns

Component(s)

C++, Python, Parquet

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions