-
Notifications
You must be signed in to change notification settings - Fork 1.6k
speedup date_trunc
(~7x faster) in some cases
#16859
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from 2 commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -28,7 +28,7 @@ use arrow::array::types::{ | |
ArrowTimestampType, TimestampMicrosecondType, TimestampMillisecondType, | ||
TimestampNanosecondType, TimestampSecondType, | ||
}; | ||
use arrow::array::{Array, PrimitiveArray}; | ||
use arrow::array::{Array, ArrayRef, Int64Array, PrimitiveArray}; | ||
use arrow::datatypes::DataType::{self, Null, Timestamp, Utf8, Utf8View}; | ||
use arrow::datatypes::TimeUnit::{self, Microsecond, Millisecond, Nanosecond, Second}; | ||
use datafusion_common::cast::as_primitive_array; | ||
|
@@ -60,6 +60,8 @@ use chrono::{ | |
- hour / HOUR | ||
- minute / MINUTE | ||
- second / SECOND | ||
- millisecond / MILLISECOND | ||
- microsecond / MICROSECOND | ||
"# | ||
), | ||
argument( | ||
|
@@ -185,6 +187,21 @@ impl ScalarUDFImpl for DateTruncFunc { | |
) -> Result<ColumnarValue> { | ||
let parsed_tz = parse_tz(tz_opt)?; | ||
let array = as_primitive_array::<T>(array)?; | ||
|
||
// fast path for fine granularities | ||
if matches!( | ||
granularity.as_str(), | ||
"second" | "minute" | "millisecond" | "microsecond" | ||
) || (parsed_tz.is_none() && granularity.as_str() == "hour") | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. let's unlock the optimization for "day" in zoneless case. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. make sense 👍 346ae59 |
||
{ | ||
let result = general_date_trunc_array_fine_granularity( | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Does this assume zone offsets are multiples of a whole hour? (for example, Asia/Kathmandu is +05:45) There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The Added a test case f50fb3f for this case (TIL There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Because (AFAIK) no timezone has a different, non-integer offset on minute like There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. You're right. For minute and smaller granularities the new code path is always fine¹
![]()
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. SELECT date_trunc('minute', arrow_cast('1910-01-01T00:00:00Z', 'Timestamp(Second, Some("Asia/Kathmandu"))'));
SELECT date_trunc('minute', arrow_cast(v, 'Timestamp(Second, Some("Asia/Kathmandu"))')) FROM (VALUES ('1910-01-01T00:00:00Z')) t(v);
this doesn't look correct to me There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. It doesn't look right to me either. First, it's not truncating to the minute, and second I have my doubts that TZ is correct with that time. I'd have to think about that a bit more but I think a ticket should be filed to have that looked at some more There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
it looks it does let timestamp = chrono::DateTime::parse_from_rfc3339("1910-01-01T00:00:00Z").unwrap();
dbg!(×tamp);
let tz = chrono_tz::Asia::Kathmandu;
// let tz = arrow::array::timezone::Tz::from_str("Asia/Kathmandu").unwrap();
dbg!(&tz);
let zoned_date_time = timestamp.with_timezone(&tz);
dbg!(&zoned_date_time);
let local_date_time = zoned_date_time.naive_local();
dbg!(&local_date_time);
let truncated_local = local_date_time.with_second(0).unwrap();
dbg!(&truncated_local);
let mapped_back = match truncated_local.and_local_timezone(tz) {
MappedLocalTime::Single(mapped) => mapped,
_ => panic!(),
};
dbg!(&mapped_back);
let timestamp = mapped_back.to_utc();
dbg!(×tamp);
dbg!(timestamp.timestamp());
however either us, or arrow, makes assumption that offsets are whole minutes. this is visible in the output above (https://github.com/apache/datafusion/pull/16859/files#r2229281161). maybe that's rendering in the CLI output rendering layer. it makes reasoning about this harder.
while per-chrono above, the correct result seems to be -1893456016_000000000 There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. |
||
T::UNIT, | ||
array, | ||
granularity.as_str(), | ||
)?; | ||
return Ok(ColumnarValue::Array(result)); | ||
} | ||
|
||
let array: PrimitiveArray<T> = array | ||
.try_unary(|x| { | ||
general_date_trunc(T::UNIT, x, parsed_tz, granularity.as_str()) | ||
|
@@ -423,6 +440,51 @@ fn date_trunc_coarse(granularity: &str, value: i64, tz: Option<Tz>) -> Result<i6 | |
Ok(value.unwrap()) | ||
} | ||
|
||
/// Fast path for fine granularities (hour and smaller) that can be handled | ||
/// with simple arithmetic operations without calendar complexity. | ||
/// | ||
/// This function is timezone-agnostic and should only be used when: | ||
/// - No timezone is specified in the input, OR | ||
/// - The granularity is less than hour as hour can be affected by DST transitions in some cases | ||
fn general_date_trunc_array_fine_granularity<T: ArrowTimestampType>( | ||
tu: TimeUnit, | ||
array: &PrimitiveArray<T>, | ||
granularity: &str, | ||
) -> Result<ArrayRef> { | ||
let unit = match (tu, granularity) { | ||
(Second, "minute") => Some(Int64Array::new_scalar(60)), | ||
(Second, "hour") => Some(Int64Array::new_scalar(3600)), | ||
|
||
(Millisecond, "second") => Some(Int64Array::new_scalar(1_000)), | ||
(Millisecond, "minute") => Some(Int64Array::new_scalar(60_000)), | ||
(Millisecond, "hour") => Some(Int64Array::new_scalar(3_600_000)), | ||
|
||
(Microsecond, "millisecond") => Some(Int64Array::new_scalar(1_000)), | ||
(Microsecond, "second") => Some(Int64Array::new_scalar(1_000_000)), | ||
(Microsecond, "minute") => Some(Int64Array::new_scalar(60_000_000)), | ||
(Microsecond, "hour") => Some(Int64Array::new_scalar(3_600_000_000)), | ||
|
||
(Nanosecond, "microsecond") => Some(Int64Array::new_scalar(1_000)), | ||
(Nanosecond, "millisecond") => Some(Int64Array::new_scalar(1_000_000)), | ||
(Nanosecond, "second") => Some(Int64Array::new_scalar(1_000_000_000)), | ||
(Nanosecond, "minute") => Some(Int64Array::new_scalar(60_000_000_000)), | ||
(Nanosecond, "hour") => Some(Int64Array::new_scalar(3_600_000_000_000)), | ||
_ => None, | ||
}; | ||
|
||
if let Some(unit) = unit { | ||
let original_type = array.data_type(); | ||
let array = arrow::compute::cast(array, &DataType::Int64)?; | ||
let array = arrow::compute::kernels::numeric::div(&array, &unit)?; | ||
let array = arrow::compute::kernels::numeric::mul(&array, &unit)?; | ||
let array = arrow::compute::cast(&array, original_type)?; | ||
Ok(array) | ||
} else { | ||
// truncate to the same or smaller unit | ||
Ok(Arc::new(array.clone())) | ||
} | ||
} | ||
|
||
// truncates a single value with the given timeunit to the specified granularity | ||
fn general_date_trunc( | ||
tu: TimeUnit, | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
minutes
is correct for modern zones, but not historical dates.The old code apparently has some issues with them too though (https://github.com/apache/datafusion/pull/16859/files#r2229547803)
Let's add a code comment about our conscious ignorance of historical zone offsets here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add some comments in 346ae59