Skip to content
Merged
Show file tree
Hide file tree
Changes from 2 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
64 changes: 63 additions & 1 deletion datafusion/functions/src/datetime/date_trunc.rs
Original file line number Diff line number Diff line change
Expand Up @@ -28,7 +28,7 @@ use arrow::array::types::{
ArrowTimestampType, TimestampMicrosecondType, TimestampMillisecondType,
TimestampNanosecondType, TimestampSecondType,
};
use arrow::array::{Array, PrimitiveArray};
use arrow::array::{Array, ArrayRef, Int64Array, PrimitiveArray};
use arrow::datatypes::DataType::{self, Null, Timestamp, Utf8, Utf8View};
use arrow::datatypes::TimeUnit::{self, Microsecond, Millisecond, Nanosecond, Second};
use datafusion_common::cast::as_primitive_array;
Expand Down Expand Up @@ -60,6 +60,8 @@ use chrono::{
- hour / HOUR
- minute / MINUTE
- second / SECOND
- millisecond / MILLISECOND
- microsecond / MICROSECOND
"#
),
argument(
Expand Down Expand Up @@ -185,6 +187,21 @@ impl ScalarUDFImpl for DateTruncFunc {
) -> Result<ColumnarValue> {
let parsed_tz = parse_tz(tz_opt)?;
let array = as_primitive_array::<T>(array)?;

// fast path for fine granularities
if matches!(
granularity.as_str(),
"second" | "minute" | "millisecond" | "microsecond"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minutes is correct for modern zones, but not historical dates.
The old code apparently has some issues with them too though (https://github.com/apache/datafusion/pull/16859/files#r2229547803)
Let's add a code comment about our conscious ignorance of historical zone offsets here.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add some comments in 346ae59

) || (parsed_tz.is_none() && granularity.as_str() == "hour")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's unlock the optimization for "day" in zoneless case.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

make sense 👍 346ae59

{
let result = general_date_trunc_array_fine_granularity(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this assume zone offsets are multiples of a whole hour?

(for example, Asia/Kathmandu is +05:45)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The parsed_tz.is_none() check ensures this array is not associated with any timezone (hence the default UTC) and is safe to do in this way.

Added a test case f50fb3f for this case (TIL

Copy link
Member

@findepi findepi Jul 24, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The parsed_tz.is_none() check

it's there for hour, but not for minute.
Why have different zone-related conditions for hour and minute?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because (AFAIK) no timezone has a different, non-integer offset on minute like +04:32:10, so we can truncate all timestamps at minute level regardless of their timezone

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right. For minute and smaller granularities the new code path is always fine¹
For hour it's also fine, for zone-less data.

image
  • for zone-less data (timestamp without time zone), day granularity can also be applied with just divide and mul, just like hour. Or am i blind again?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SELECT date_trunc('minute', arrow_cast('1910-01-01T00:00:00Z', 'Timestamp(Second, Some("Asia/Kathmandu"))'));
SELECT date_trunc('minute', arrow_cast(v, 'Timestamp(Second, Some("Asia/Kathmandu"))')) FROM (VALUES ('1910-01-01T00:00:00Z')) t(v);
+-----------------------------------------------------------------------------------------------------------------------+
| date_trunc(Utf8("minute"),arrow_cast(Utf8("1910-01-01T00:00:00Z"),Utf8("Timestamp(Second, Some("Asia/Kathmandu"))"))) |
+-----------------------------------------------------------------------------------------------------------------------+
| 1910-01-01T05:41:16+05:41                                                                                             |
+-----------------------------------------------------------------------------------------------------------------------+
1 row(s) fetched.
Elapsed 0.065 seconds.

+----------------------------------------------------------------------------------------------+
| date_trunc(Utf8("minute"),arrow_cast(t.v,Utf8("Timestamp(Second, Some("Asia/Kathmandu"))"))) |
+----------------------------------------------------------------------------------------------+
| 1910-01-01T05:41:16+05:41                                                                    |
+----------------------------------------------------------------------------------------------+
1 row(s) fetched.
Elapsed 0.012 seconds.

this doesn't look correct to me
however, this is the same result as on main, so it's not a regression.
Is it because the existing date_trunc code do similar logic as the new optimized code path?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It doesn't look right to me either. First, it's not truncating to the minute, and second I have my doubts that TZ is correct with that time. I'd have to think about that a bit more but I think a ticket should be filed to have that looked at some more

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • Not sure whether chrono respects that

it looks it does

let timestamp = chrono::DateTime::parse_from_rfc3339("1910-01-01T00:00:00Z").unwrap();
dbg!(&timestamp);
let tz = chrono_tz::Asia::Kathmandu;
// let tz = arrow::array::timezone::Tz::from_str("Asia/Kathmandu").unwrap();
dbg!(&tz);
let zoned_date_time = timestamp.with_timezone(&tz);
dbg!(&zoned_date_time);
let local_date_time = zoned_date_time.naive_local();
dbg!(&local_date_time);
let truncated_local = local_date_time.with_second(0).unwrap();
dbg!(&truncated_local);
let mapped_back = match truncated_local.and_local_timezone(tz) {
    MappedLocalTime::Single(mapped) => mapped,
    _ => panic!(),
};
dbg!(&mapped_back);
let timestamp = mapped_back.to_utc();
dbg!(&timestamp);
dbg!(timestamp.timestamp());
[tests/chrono.rs:7:1] &timestamp = 1910-01-01T00:00:00+00:00
[tests/chrono.rs:10:1] &tz = Asia/Kathmandu
[tests/chrono.rs:12:1] &zoned_date_time = 1910-01-01T05:41:16LMT
[tests/chrono.rs:14:1] &local_date_time = 1910-01-01T05:41:16
[tests/chrono.rs:16:1] &truncated_local = 1910-01-01T05:41:00
[tests/chrono.rs:21:1] &mapped_back = 1910-01-01T05:41:00LMT
[tests/chrono.rs:23:1] &timestamp = 1909-12-31T23:59:44Z
[tests/chrono.rs:24:1] timestamp.timestamp() = -1893456016

however either us, or arrow, makes assumption that offsets are whole minutes. this is visible in the output above (https://github.com/apache/datafusion/pull/16859/files#r2229281161). maybe that's rendering in the CLI output rendering layer. it makes reasoning about this harder.

SELECT cast(date_trunc('minute', arrow_cast(v, 'Timestamp(Second, Some("Asia/Kathmandu"))')) as bigint) FROM (VALUES ('1910-01-01T00:00:00Z')) t(v);
  • main: -1893456000_000000000
  • PR: -1893456000_000000000

while per-chrono above, the correct result seems to be -1893456016_000000000

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for your very detailed check and explanation. I find another example with an extra search (that website is great)...

image

T::UNIT,
array,
granularity.as_str(),
)?;
return Ok(ColumnarValue::Array(result));
}

let array: PrimitiveArray<T> = array
.try_unary(|x| {
general_date_trunc(T::UNIT, x, parsed_tz, granularity.as_str())
Expand Down Expand Up @@ -423,6 +440,51 @@ fn date_trunc_coarse(granularity: &str, value: i64, tz: Option<Tz>) -> Result<i6
Ok(value.unwrap())
}

/// Fast path for fine granularities (hour and smaller) that can be handled
/// with simple arithmetic operations without calendar complexity.
///
/// This function is timezone-agnostic and should only be used when:
/// - No timezone is specified in the input, OR
/// - The granularity is less than hour as hour can be affected by DST transitions in some cases
fn general_date_trunc_array_fine_granularity<T: ArrowTimestampType>(
tu: TimeUnit,
array: &PrimitiveArray<T>,
granularity: &str,
) -> Result<ArrayRef> {
let unit = match (tu, granularity) {
(Second, "minute") => Some(Int64Array::new_scalar(60)),
(Second, "hour") => Some(Int64Array::new_scalar(3600)),

(Millisecond, "second") => Some(Int64Array::new_scalar(1_000)),
(Millisecond, "minute") => Some(Int64Array::new_scalar(60_000)),
(Millisecond, "hour") => Some(Int64Array::new_scalar(3_600_000)),

(Microsecond, "millisecond") => Some(Int64Array::new_scalar(1_000)),
(Microsecond, "second") => Some(Int64Array::new_scalar(1_000_000)),
(Microsecond, "minute") => Some(Int64Array::new_scalar(60_000_000)),
(Microsecond, "hour") => Some(Int64Array::new_scalar(3_600_000_000)),

(Nanosecond, "microsecond") => Some(Int64Array::new_scalar(1_000)),
(Nanosecond, "millisecond") => Some(Int64Array::new_scalar(1_000_000)),
(Nanosecond, "second") => Some(Int64Array::new_scalar(1_000_000_000)),
(Nanosecond, "minute") => Some(Int64Array::new_scalar(60_000_000_000)),
(Nanosecond, "hour") => Some(Int64Array::new_scalar(3_600_000_000_000)),
_ => None,
};

if let Some(unit) = unit {
let original_type = array.data_type();
let array = arrow::compute::cast(array, &DataType::Int64)?;
let array = arrow::compute::kernels::numeric::div(&array, &unit)?;
let array = arrow::compute::kernels::numeric::mul(&array, &unit)?;
let array = arrow::compute::cast(&array, original_type)?;
Ok(array)
} else {
// truncate to the same or smaller unit
Ok(Arc::new(array.clone()))
}
}

// truncates a single value with the given timeunit to the specified granularity
fn general_date_trunc(
tu: TimeUnit,
Expand Down
2 changes: 2 additions & 0 deletions docs/source/user-guide/sql/scalar_functions.md
Original file line number Diff line number Diff line change
Expand Up @@ -2150,6 +2150,8 @@ date_trunc(precision, expression)
- hour / HOUR
- minute / MINUTE
- second / SECOND
- millisecond / MILLISECOND
- microsecond / MICROSECOND

- **expression**: Time expression to operate on. Can be a constant, column, or function.

Expand Down