Skip to content

Conversation

mhaseeb123
Copy link
Member

@mhaseeb123 mhaseeb123 commented Oct 16, 2025

Description

Closes #19748

This PR implements a GPU Fenwick tree + search algorithm along with several host side optimizations to significantly accelerate data page computation mask in the next-gen parquet reader. See before and after performance results here. Credits to @vuule for the Fenwick tree solution for this problem (Thanks again!)

Checklist

  • I am familiar with the Contributing Guidelines.
  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.

@copy-pr-bot
Copy link

copy-pr-bot bot commented Oct 16, 2025

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

@github-actions github-actions bot added the libcudf Affects libcudf (C++/CUDA) code. label Oct 16, 2025
@github-actions github-actions bot added the CMake CMake build issue label Oct 16, 2025
@mhaseeb123 mhaseeb123 added 2 - In Progress Currently a work in progress improvement Improvement / enhancement to an existing function non-breaking Non-breaking change cuIO cuIO issue labels Oct 16, 2025
@mhaseeb123
Copy link
Member Author

mhaseeb123 commented Oct 17, 2025

Performance Improvement

The Fenwick tree based computation of data page mask along with other optimizations in this PR significantly improves the time to compute data page mask in the second step of table materialization in the next-gen reader.

Setup

  • Generate a parquet file using the following python script
Click to expand
import cudf
import cupy
df = cudf.DataFrame()
nrows = 100_000_000
df['key'] = cupy.random.randint(0, 1000000, nrows)
df['key'] = df['key'].astype(str)
x = 1000
y = nrows // x + 1
for i in range(10):
    df[f'payload_{i}'] = cupy.tile(cupy.random.rand(x), y)[0:nrows]
lookup_value = 4500000000
df.at[99, 'key'] = str(lookup_value)
# make sure to write page index with `statistics='COLUMN'`
df.to_parquet('/path/to/generated/file.parquet', statistics='COLUMN')
  • Run/profile the hybrid_scan_io example on the generated parquet file as:
hybrid_scan_io /path/to/generated/file.parquet key 4500000000

Before this PR (with thread pool)

The compute_data_page_mask bottlenecks the materialize_payload_columns pushing its total time to 80ms+

image (1)

With this PR

The compute_data_page_mask completes in < 5ms with total time for materialize_payload_columns being 19ms

Screenshot 2025-10-16 at 11 24 37 PM


namespace {

/**
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All these utilities moved to page_index_filter_utils.hpp/.cu files

@@ -0,0 +1,262 @@

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All these utilities moved as is from page_index_filter.cu

return page_indices;
}

std::vector<size_type> compute_fenwick_tree_level_offsets(cudf::size_type level0_size)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a new utility function though

@@ -0,0 +1,101 @@
/*
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All this moved as is from page_index_filter.cu

auto const num_columns = input_columns.size();
auto const total_rows = total_rows_in_row_groups(row_group_indices);

// Return an empty vector if all rows are invalid or all rows are required
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moved this check earlier on.

return all_required_data_pages(page_row_counts);
// Make sure all row_mask elements contain valid values even if they are nulls
if constexpr (cuda::std::is_same_v<ColumnView, cudf::mutable_column_view>) {
if (row_mask.nullable()) {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nulls are to be treated as true in the Fenwick tree so set row_mask as true at those positions in case of filter columns

});
}
} else {
CUDF_EXPECTS(not row_mask.nullable() or row_mask.null_count() == 0,
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Row mask is guaranteed to not have nulls in case of payload columns

auto data_page_mask = cudf::detail::make_empty_host_vector<bool>(total_pages, stream);
auto host_results_iter = host_results.begin();
stream.synchronize();
// Discard results for invalid ranges. i.e. ranges starting at the last page of a column and
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We will have one invalid range between pages of consecutive columns which we need to discard. For example, consider each column has 3 pages each (each page containing 100 rows) then our flattened page offsets will look like this: [0, 100, 200, 300, 0, 100, 200, 300, ...] and so the range given by [300, 0) will be invalid.

@mhaseeb123 mhaseeb123 removed the 2 - In Progress Currently a work in progress label Oct 17, 2025
@mhaseeb123 mhaseeb123 added the 3 - Ready for Review Ready for review by team label Oct 17, 2025
@mhaseeb123 mhaseeb123 marked this pull request as ready for review October 17, 2025 06:56
@mhaseeb123 mhaseeb123 requested review from a team as code owners October 17, 2025 06:56
@mhaseeb123 mhaseeb123 requested a review from ttnghia October 17, 2025 17:37
};
write_parquet(lhs_table, "lhs_table.parquet", stream);
write_parquet(rhs_table, "rhs_table.parquet", stream);
throw std::logic_error("Tables identical: false\n\n");
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We want to see an exception if the tables aren't equal to alert the user

@GregoryKimball GregoryKimball moved this to Burndown in libcudf Oct 17, 2025
Comment on lines +461 to +499
__device__ bool inline constexpr is_power_of_two(cudf::size_type value) const noexcept
{
return (value & (value - 1)) == 0;
}

/**
* @brief Finds the smallest power of two in the range [start, end); 0 otherwise
*
* @param start Range start
* @param end Range end
* @return Largest power of two in the range [start, end); 0 otherwise
*/
__device__ cudf::size_type inline constexpr smallest_power_of_two_in_range(
cudf::size_type start, cudf::size_type end) const noexcept
{
start--;
start |= start >> 1;
start |= start >> 2;
start |= start >> 4;
start |= start >> 8;
start |= start >> 16;
auto const result = start + 1;
return result < end ? result : 0;
}

/**
* @brief Finds the largest power of two in the range (start, end]; 0 otherwise
*
* @param start Range start
* @param end Range end
* @return Largest power of two in the range (start, end]; 0 otherwise
*/
__device__ size_type inline constexpr largest_power_of_two_in_range(size_type start,
size_type end) const noexcept
{
auto constexpr nbits = cudf::detail::size_in_bits<size_type>() - 1;
auto const result = size_type{1} << (nbits - cuda::std::countl_zero<uint32_t>(end));
return result > start ? result : 0;
}
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel like these should be moved somewhere in include/cudf/details/utilities/... (maybe cuda.cuh or integer_utils.cuh) in the future if required

@mhaseeb123
Copy link
Member Author

Hi @res-life, would you like to play with the new reader (via the example even) with this PR included in your build. See the performance difference with this PR here: #20280 (comment)

@mhaseeb123 mhaseeb123 added 4 - Needs Review Waiting for reviewer to review or respond and removed 3 - Ready for Review Ready for review by team labels Oct 21, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

4 - Needs Review Waiting for reviewer to review or respond CMake CMake build issue cuIO cuIO issue improvement Improvement / enhancement to an existing function libcudf Affects libcudf (C++/CUDA) code. non-breaking Non-breaking change

Projects

Status: Burndown

Development

Successfully merging this pull request may close these issues.

[FEA] Accelerate compute_data_page_mask using a custom kernel

2 participants