-
Notifications
You must be signed in to change notification settings - Fork 980
Accelerate data page mask computation on device #20280
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Accelerate data page mask computation on device #20280
Conversation
Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually. Contributors can view more details about this message here. |
|
||
namespace { | ||
|
||
/** |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All these utilities moved to page_index_filter_utils.hpp/.cu
files
@@ -0,0 +1,262 @@ | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All these utilities moved as is from page_index_filter.cu
return page_indices; | ||
} | ||
|
||
std::vector<size_type> compute_fenwick_tree_level_offsets(cudf::size_type level0_size) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a new utility function though
@@ -0,0 +1,101 @@ | |||
/* |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All this moved as is from page_index_filter.cu
auto const num_columns = input_columns.size(); | ||
auto const total_rows = total_rows_in_row_groups(row_group_indices); | ||
|
||
// Return an empty vector if all rows are invalid or all rows are required |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Moved this check earlier on.
return all_required_data_pages(page_row_counts); | ||
// Make sure all row_mask elements contain valid values even if they are nulls | ||
if constexpr (cuda::std::is_same_v<ColumnView, cudf::mutable_column_view>) { | ||
if (row_mask.nullable()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nulls are to be treated as true
in the Fenwick tree so set row_mask
as true
at those positions in case of filter columns
}); | ||
} | ||
} else { | ||
CUDF_EXPECTS(not row_mask.nullable() or row_mask.null_count() == 0, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Row mask is guaranteed to not have nulls in case of payload columns
auto data_page_mask = cudf::detail::make_empty_host_vector<bool>(total_pages, stream); | ||
auto host_results_iter = host_results.begin(); | ||
stream.synchronize(); | ||
// Discard results for invalid ranges. i.e. ranges starting at the last page of a column and |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We will have one invalid range between pages of consecutive columns which we need to discard. For example, consider each column has 3 pages each (each page containing 100 rows) then our flattened page offsets will look like this: [0, 100, 200, 300, 0, 100, 200, 300, ...] and so the range given by [300, 0) will be invalid.
}; | ||
write_parquet(lhs_table, "lhs_table.parquet", stream); | ||
write_parquet(rhs_table, "rhs_table.parquet", stream); | ||
throw std::logic_error("Tables identical: false\n\n"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We want to see an exception if the tables aren't equal to alert the user
__device__ bool inline constexpr is_power_of_two(cudf::size_type value) const noexcept | ||
{ | ||
return (value & (value - 1)) == 0; | ||
} | ||
|
||
/** | ||
* @brief Finds the smallest power of two in the range [start, end); 0 otherwise | ||
* | ||
* @param start Range start | ||
* @param end Range end | ||
* @return Largest power of two in the range [start, end); 0 otherwise | ||
*/ | ||
__device__ cudf::size_type inline constexpr smallest_power_of_two_in_range( | ||
cudf::size_type start, cudf::size_type end) const noexcept | ||
{ | ||
start--; | ||
start |= start >> 1; | ||
start |= start >> 2; | ||
start |= start >> 4; | ||
start |= start >> 8; | ||
start |= start >> 16; | ||
auto const result = start + 1; | ||
return result < end ? result : 0; | ||
} | ||
|
||
/** | ||
* @brief Finds the largest power of two in the range (start, end]; 0 otherwise | ||
* | ||
* @param start Range start | ||
* @param end Range end | ||
* @return Largest power of two in the range (start, end]; 0 otherwise | ||
*/ | ||
__device__ size_type inline constexpr largest_power_of_two_in_range(size_type start, | ||
size_type end) const noexcept | ||
{ | ||
auto constexpr nbits = cudf::detail::size_in_bits<size_type>() - 1; | ||
auto const result = size_type{1} << (nbits - cuda::std::countl_zero<uint32_t>(end)); | ||
return result > start ? result : 0; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I feel like these should be moved somewhere in include/cudf/details/utilities/...
(maybe cuda.cuh or integer_utils.cuh) in the future if required
Hi @res-life, would you like to play with the new reader (via the example even) with this PR included in your build. See the performance difference with this PR here: #20280 (comment) |
Description
Closes #19748
This PR implements a GPU Fenwick tree + search algorithm along with several host side optimizations to significantly accelerate data page computation mask in the next-gen parquet reader. See before and after performance results here. Credits to @vuule for the Fenwick tree solution for this problem (Thanks again!)
Checklist