Skip to content

refactor character_length impl by unifying null handling logic #16877

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 4 commits into
base: main
Choose a base branch
from
Open
Changes from 2 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
41 changes: 8 additions & 33 deletions datafusion/functions/src/unicode/character_length.rs
Original file line number Diff line number Diff line change
Expand Up @@ -136,56 +136,31 @@ where
// string is ASCII only is relatively cheap.
// If strings are ASCII only, count bytes instead.
let is_array_ascii_only = array.is_ascii();
let array = if array.null_count() == 0 {
let nulls = array.nulls().cloned();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the idea to remove the no-nulls optimization because it doesn't make things faster?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think avoiding the null check (array.is_null(i)) makes things faster and it seems we don't need 0 as we always cloned array.nulls().cloned()

let array = {
if is_array_ascii_only {
let values: Vec<_> = (0..array.len())
.map(|i| {
let value = array.value(i);
// Safety: we are iterating with array.len() so the index is always valid
let value = unsafe { array.value_unchecked(i) };
T::Native::usize_as(value.len())
})
.collect();
PrimitiveArray::<T>::new(values.into(), None)
PrimitiveArray::<T>::new(values.into(), nulls)
} else {
let values: Vec<_> = (0..array.len())
.map(|i| {
let value = array.value(i);
// Safety: we are iterating with array.len() so the index is always valid
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another idea here is if you know the values are always ascii, you can avoid making a &str at all -- and instead simply compute the character lengths based on the offsets array (for StringArray and LargeStringArray) or the views for `StringViewArray)

let value = unsafe { array.value_unchecked(i) };
if value.is_ascii() {
T::Native::usize_as(value.len())
} else {
T::Native::usize_as(value.chars().count())
}
})
.collect();
PrimitiveArray::<T>::new(values.into(), None)
PrimitiveArray::<T>::new(values.into(), nulls)
}
} else if is_array_ascii_only {
let values: Vec<_> = (0..array.len())
.map(|i| {
if array.is_null(i) {
T::default_value()
} else {
let value = array.value(i);
T::Native::usize_as(value.len())
}
})
.collect();
PrimitiveArray::<T>::new(values.into(), array.nulls().cloned())
} else {
let values: Vec<_> = (0..array.len())
.map(|i| {
if array.is_null(i) {
T::default_value()
} else {
let value = array.value(i);
if value.is_ascii() {
T::Native::usize_as(value.len())
} else {
T::Native::usize_as(value.chars().count())
}
}
})
.collect();
PrimitiveArray::<T>::new(values.into(), array.nulls().cloned())
};

Ok(Arc::new(array))
Expand Down