-
Notifications
You must be signed in to change notification settings - Fork 4k
[GH-48691][C++] Write serializer could be crash if the value buffer is empty #48692
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
|
Thanks for opening a pull request! If this is not a minor PR. Could you open an issue for this pull request on GitHub? https://github.com/apache/arrow/issues/new/choose Opening GitHub issues ahead of time contributes to the Openness of the Apache Arrow project. Then could you also rename the pull request title in the following format? or See also: |
|
@wgtmac Hi, could you please help review this, thanks! |
| // Set all bits to 0 (null) | ||
| ::arrow::bit_util::SetBitsTo(null_bitmap->mutable_data(), 0, 100, false); | ||
|
|
||
| std::shared_ptr<::arrow::Buffer> data_buffer = nullptr; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please correct me if I was wrong. I think the Arrow spec is vague on whether the value buffer can be null if all values are null. It also escapes the Array::Validate check as in
arrow/cpp/src/arrow/array/validate.cc
Lines 505 to 507 in abbcd53
| if (buffer == nullptr) { | |
| continue; | |
| } |
If this violates the spec, is it better to fix Array::Validate() and calls it before calling functor.Serialize()?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Arrow spec is vague on whether the value buffer can be null
Yes, that also confuses me, I think we don't support null value buffer but accept empty value buffer if it's all nulls in the batch? Seems we're avoiding null value buffers: https://github.com/apache/arrow/pull/2243/changes
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IIRC it's deliberate that a null buffer pointer is accepted there. I would rather not have this but it could break compatibility with existing usage.
In any case, feel free to open a separate issue about it.
| SerializeFunctor<ParquetType, ArrowType> functor; | ||
| RETURN_NOT_OK(functor.Serialize(checked_cast<const ArrayType&>(array), ctx, buffer)); | ||
| // The value buffer could be empty if all values are nulls. | ||
| if (array.null_count() != array.length()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In case of an invalid arrow array, value buffer can still be nullptr when array.null_count() == array.length() which crashes the following call.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But we don't care about invalid Arrow arrays here, do we?
| ASSERT_OK_AND_ASSIGN(null_bitmap, ::arrow::AllocateBitmap(100)); | ||
| // Set all bits to 0 (null) | ||
| ::arrow::bit_util::SetBitsTo(null_bitmap->mutable_data(), 0, 100, false); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can just use AllocateEmptyBitmap which will zero-initialize the bitmap.
| // Set all bits to 0 (null) | ||
| ::arrow::bit_util::SetBitsTo(null_bitmap->mutable_data(), 0, 100, false); | ||
|
|
||
| std::shared_ptr<::arrow::Buffer> data_buffer = nullptr; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IIRC it's deliberate that a null buffer pointer is accepted there. I would rather not have this but it could break compatibility with existing usage.
In any case, feel free to open a separate issue about it.
Rationale for this change
WriteArrowSerialize could unconditionally read values from the Arrow array even for null rows. Since it's possible the caller could provided a zero-sized dummy buffer for all-null arrays, this caused an ASAN heap-buffer-overflow.
What changes are included in this PR?
Early check the array is not all null values before serialize it
Are these changes tested?
Added tests.
Are there any user-facing changes?
No