-
Notifications
You must be signed in to change notification settings - Fork 4k
Open
Labels
Component: C++Component: ParquetComponent: PythonStatus: stale-warningIssues and PRs flagged as stale which are due to be closed if no indication otherwiseIssues and PRs flagged as stale which are due to be closed if no indication otherwiseType: enhancement
Description
Describe the enhancement requested
Hi,
One of the design principles of parquet from their Github page is 'Separating metadata and column data':
Separating metadata and column data.
The format is explicitly designed to separate the metadata from the data. This allows splitting columns into multiple files, as well as having a single metadata file reference multiple parquet files.
In order to achieve the 'columns in different files', we need to
- Ensure each file has the same number of row-groups
- Ensure each corresponding row-group of each file have the same rows
- Grab the 'metadata' from each file, 'zip/attach them vertically', and write out the new metadata file
- Feed this metadata while reading the table
It looks like Arrow APIs provide nearly everything to achieve this, except for the bolded portion in point 3 above.
This ticket is requesting the addition of a new API to be able to 'zip'/'join'/'attach' metadata from 2 files.
For e.g.
import pyarrow.parquet as pq
m1 = pq.read_metadata('file1.parquet') # say this has columns: col1, col2, col3
m1.set_file_path('file1.parquet')
m2 = pq.read_metadata('file2.parquet') # say this has columns: col4, col5
m2.set_file_path('file2.parquet')
# requesting this new 'zip' API
m = m1.zip(m2) # needs to ensure same number of row groups, and same number of rows within each row group
# m will now have metadata for col1, col2, col3, col4, col5 each pointing to appropriate data file
m.write_metadata('_metadata')One this is done, a combined data can be created using:
m = pq.read_metadata('_metadata')
data = pq.ParquetFile('file1.parquet', 'file2.parquet', metadata=m)
# data should now be able to show all columnsComponent(s)
C++, Python, Parquet
Metadata
Metadata
Assignees
Labels
Component: C++Component: ParquetComponent: PythonStatus: stale-warningIssues and PRs flagged as stale which are due to be closed if no indication otherwiseIssues and PRs flagged as stale which are due to be closed if no indication otherwiseType: enhancement