Accelerate data page mask computation on device #20280

mhaseeb123 · 2025-10-16T00:29:15Z

Description

Closes #19748

This PR implements a GPU Fenwick tree + search algorithm along with several host side optimizations to significantly accelerate data page computation mask in the next-gen parquet reader. See before and after performance results here. Credits to @vuule for the Fenwick tree solution for this problem (Thanks again!)

Checklist

I am familiar with the Contributing Guidelines.
New or existing tests cover these changes.
The documentation is up to date with these changes.

copy-pr-bot · 2025-10-16T00:29:19Z

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

mhaseeb123 · 2025-10-17T06:28:44Z

Performance Improvement

The Fenwick tree based computation of data page mask along with other optimizations in this PR significantly improves the time to compute data page mask in the second step of table materialization in the next-gen reader.

Setup

Generate a parquet file using the following python script

Click to expand

import cudf
import cupy
df = cudf.DataFrame()
nrows = 100_000_000
df['key'] = cupy.random.randint(0, 1000000, nrows)
df['key'] = df['key'].astype(str)
x = 1000
y = nrows // x + 1
for i in range(10):
    df[f'payload_{i}'] = cupy.tile(cupy.random.rand(x), y)[0:nrows]
lookup_value = 4500000000
df.at[99, 'key'] = str(lookup_value)
# make sure to write page index with `statistics='COLUMN'`
df.to_parquet('/path/to/generated/file.parquet', statistics='COLUMN')

Run/profile the hybrid_scan_io example on the generated parquet file as:

hybrid_scan_io /path/to/generated/file.parquet key 4500000000

Before this PR (with thread pool)

The compute_data_page_mask bottlenecks the materialize_payload_columns pushing its total time to 80ms+

With this PR

The compute_data_page_mask completes in < 5ms with total time for materialize_payload_columns being 19ms

mhaseeb123 · 2025-10-17T06:37:56Z

cpp/src/io/parquet/experimental/page_index_filter.cu


 namespace {

-/**


All these utilities moved to page_index_filter_utils.hpp/.cu files

mhaseeb123 · 2025-10-17T06:39:06Z

cpp/src/io/parquet/experimental/page_index_filter_utils.cu

@@ -0,0 +1,262 @@
+


All these utilities moved as is from page_index_filter.cu

mhaseeb123 · 2025-10-17T06:39:24Z

cpp/src/io/parquet/experimental/page_index_filter_utils.cu

+  return page_indices;
+}
+
+std::vector<size_type> compute_fenwick_tree_level_offsets(cudf::size_type level0_size)


This is a new utility function though

mhaseeb123 · 2025-10-17T06:39:43Z

cpp/src/io/parquet/experimental/page_index_filter_utils.hpp

@@ -0,0 +1,101 @@
+/*


All this moved as is from page_index_filter.cu

mhaseeb123 · 2025-10-17T06:40:32Z

cpp/src/io/parquet/experimental/page_index_filter.cu

-  auto const num_columns = input_columns.size();
+  auto const total_rows = total_rows_in_row_groups(row_group_indices);
+
+  // Return an empty vector if all rows are invalid or all rows are required


Moved this check earlier on.

mhaseeb123 · 2025-10-17T06:43:59Z

cpp/src/io/parquet/experimental/page_index_filter.cu

-    return all_required_data_pages(page_row_counts);
+  // Make sure all row_mask elements contain valid values even if they are nulls
+  if constexpr (cuda::std::is_same_v<ColumnView, cudf::mutable_column_view>) {
+    if (row_mask.nullable()) {


Nulls are to be treated as true in the Fenwick tree so set row_mask as true at those positions in case of filter columns

mhaseeb123 · 2025-10-17T06:44:38Z

cpp/src/io/parquet/experimental/page_index_filter.cu

+                       });
+    }
+  } else {
+    CUDF_EXPECTS(not row_mask.nullable() or row_mask.null_count() == 0,


Row mask is guaranteed to not have nulls in case of payload columns

mhaseeb123 · 2025-10-17T06:46:15Z

cpp/src/io/parquet/experimental/page_index_filter.cu

+  auto data_page_mask    = cudf::detail::make_empty_host_vector<bool>(total_pages, stream);
+  auto host_results_iter = host_results.begin();
+  stream.synchronize();
+  // Discard results for invalid ranges. i.e. ranges starting at the last page of a column and


We will have one invalid range between pages of consecutive columns which we need to discard. For example, consider each column has 3 pages each (each page containing 100 rows) then our flattened page offsets will look like this: [0, 100, 200, 300, 0, 100, 200, 300, ...] and so the range given by [300, 0) will be invalid.

cpp/src/io/parquet/experimental/page_index_filter.cu

cpp/src/io/parquet/experimental/page_index_filter_utils.cu

cpp/src/io/parquet/experimental/page_index_filter.cu

mhaseeb123 · 2025-10-17T17:38:32Z

cpp/examples/hybrid_scan_io/common_utils.cpp

+        };
+      write_parquet(lhs_table, "lhs_table.parquet", stream);
+      write_parquet(rhs_table, "rhs_table.parquet", stream);
+      throw std::logic_error("Tables identical: false\n\n");


We want to see an exception if the tables aren't equal to alert the user

cpp/src/io/parquet/experimental/page_index_filter.cu

mhaseeb123 · 2025-10-20T23:50:25Z

cpp/src/io/parquet/experimental/page_index_filter.cu

+  __device__ bool inline constexpr is_power_of_two(cudf::size_type value) const noexcept
+  {
+    return (value & (value - 1)) == 0;
+  }
+
+  /**
+   * @brief Finds the smallest power of two in the range [start, end); 0 otherwise
+   *
+   * @param start Range start
+   * @param end Range end
+   * @return Largest power of two in the range [start, end); 0 otherwise
+   */
+  __device__ cudf::size_type inline constexpr smallest_power_of_two_in_range(
+    cudf::size_type start, cudf::size_type end) const noexcept
+  {
+    start--;
+    start |= start >> 1;
+    start |= start >> 2;
+    start |= start >> 4;
+    start |= start >> 8;
+    start |= start >> 16;
+    auto const result = start + 1;
+    return result < end ? result : 0;
+  }
+
+  /**
+   * @brief Finds the largest power of two in the range (start, end]; 0 otherwise
+   *
+   * @param start Range start
+   * @param end Range end
+   * @return Largest power of two in the range (start, end]; 0 otherwise
+   */
+  __device__ size_type inline constexpr largest_power_of_two_in_range(size_type start,
+                                                                      size_type end) const noexcept
+  {
+    auto constexpr nbits = cudf::detail::size_in_bits<size_type>() - 1;
+    auto const result    = size_type{1} << (nbits - cuda::std::countl_zero<uint32_t>(end));
+    return result > start ? result : 0;
+  }


I feel like these should be moved somewhere in include/cudf/details/utilities/... (maybe cuda.cuh or integer_utils.cuh) in the future if required

mhaseeb123 · 2025-10-21T01:53:08Z

Hi @res-life, would you like to play with the new reader (via the example even) with this PR included in your build. See the performance difference with this PR here: #20280 (comment)

mhaseeb123 added 5 commits October 15, 2025 02:25

Flatten the data page mask

df8100d

Templatize ColumnView type for compute data page mask

bc6f08b

Minor refactor

30dbbee

Initial impl of GPU data page mask algorithm

73f64f2

Minor improvements

de65183

github-actions bot assigned mhaseeb123 Oct 16, 2025

github-actions bot added the libcudf Affects libcudf (C++/CUDA) code. label Oct 16, 2025

mhaseeb123 and others added 2 commits October 15, 2025 17:29

Merge branch 'branch-25.12' into fea/gpu-compute-data-page-mask

2519f5f

Add cmake changes

f6b79c7

github-actions bot added the CMake CMake build issue label Oct 16, 2025

Revert to old probe impl

aef9d64

mhaseeb123 added 2 - In Progress Currently a work in progress improvement Improvement / enhancement to an existing function non-breaking Non-breaking change cuIO cuIO issue labels Oct 16, 2025

mhaseeb123 and others added 10 commits October 16, 2025 01:43

style fix

ebf460a

Use functors

c62f6b7

Optimize for the chunked reader

f9e8f85

Minor optimizations

9ea83f5

Small optimizations

38a640b

CPU optimizations

0071dc7

Bug fixing in fenwick tree search

06dfca4

Bug fixing in fenwick tree

8ac9c3a

Minor optimizations

e5e12ec

Documentation and style

3d056ed

Merge branch 'branch-25.12' into fea/gpu-compute-data-page-mask

60c789f

mhaseeb123 commented Oct 17, 2025

View reviewed changes

mhaseeb123 removed the 2 - In Progress Currently a work in progress label Oct 17, 2025

mhaseeb123 added the 3 - Ready for Review Ready for review by team label Oct 17, 2025

mhaseeb123 marked this pull request as ready for review October 17, 2025 06:56

mhaseeb123 requested review from a team as code owners October 17, 2025 06:56

mhaseeb123 requested review from PointKernel, karthikeyann and vuule October 17, 2025 06:56

mhaseeb123 commented Oct 17, 2025

View reviewed changes

cpp/src/io/parquet/experimental/page_index_filter_utils.cu Outdated Show resolved Hide resolved

mhaseeb123 added 2 commits October 17, 2025 07:14

Small optimization

55948f0

Minor optimizations

a2221a5

ttnghia reviewed Oct 17, 2025

View reviewed changes

cpp/src/io/parquet/experimental/page_index_filter.cu Outdated Show resolved Hide resolved

ttnghia reviewed Oct 17, 2025

View reviewed changes

cpp/src/io/parquet/experimental/page_index_filter.cu Outdated Show resolved Hide resolved

ttnghia reviewed Oct 17, 2025

View reviewed changes

cpp/src/io/parquet/experimental/page_index_filter.cu Outdated Show resolved Hide resolved

mhaseeb123 added 2 commits October 17, 2025 17:13

Style check

fc6c534

Use host worker pool instead of c++ threads

488b1de

mhaseeb123 requested a review from ttnghia October 17, 2025 17:37

mhaseeb123 commented Oct 17, 2025

View reviewed changes

cpp/src/io/parquet/experimental/page_index_filter.cu Outdated Show resolved Hide resolved

mhaseeb123 added 2 commits October 17, 2025 10:41

Apply suggestion from @mhaseeb123

33489df

style pls

ffacb93

GregoryKimball added this to libcudf Oct 17, 2025

GregoryKimball moved this to Burndown in libcudf Oct 17, 2025

Merge branch 'main' into fea/gpu-compute-data-page-mask

c820f50

mhaseeb123 commented Oct 20, 2025

View reviewed changes

mhaseeb123 added 4 - Needs Review Waiting for reviewer to review or respond and removed 3 - Ready for Review Ready for review by team labels Oct 21, 2025

Accelerate data page mask computation on device #20280

Are you sure you want to change the base?

Accelerate data page mask computation on device #20280

Uh oh!

Conversation

mhaseeb123 commented Oct 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Checklist

Uh oh!

copy-pr-bot bot commented Oct 16, 2025

Uh oh!

mhaseeb123 commented Oct 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Performance Improvement

Setup

Before this PR (with thread pool)

With this PR

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mhaseeb123 commented Oct 21, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

mhaseeb123 commented Oct 16, 2025 •

edited

Loading

mhaseeb123 commented Oct 17, 2025 •

edited

Loading