Skip to content

Conversation

roeap
Copy link
Collaborator

@roeap roeap commented Jul 18, 2025

Description

This is the third instalment of trying to move our log replay to leverage delta-kernel. Hoping that this time the pill is just small enough to swallow :).

This introduces a regression in handling checkpoints where only stats_parsed is written and the stats field is omitted. We can still read these tables, but there will be no file stats-based file skipping during a scan. This needs to be handled upstream.

There are a few more behaviour changes:

  • removed cdc_files method, as this one was not used anywhere and did not properly handle replay for cdc files.
  • delta-kernel does currently not expose the dataChange or tags field in add action data, so it is no longer part of the RecordBatch returned for add_action_table.

There is a lot more room for improvement but the aim for this PR was to try and keep the diff as "small" as possible (not claiming that this is a small diff though 😆). My hope is that we can land this PR and then quickly follow up with some more PRs that do some more focussed refactors.

Of course there are not only the aforementioned regressions, but also new features that are enabled via this PR. In theory v2 checkpoints should just work now, which is something to follow up with quickly and properly test ...

Also, one of the most awaited features - deletion vectors (#1094) - is now in reach, and something I want to address ASAP!

cc @zachschuermann

@github-actions github-actions bot added binding/python Issues for the Python package binding/rust Issues for the Rust crate labels Jul 18, 2025
@roeap roeap force-pushed the feat/log-replay branch 3 times, most recently from 1905331 to 9cf764c Compare July 18, 2025 14:24
Copy link

codecov bot commented Jul 18, 2025

Codecov Report

❌ Patch coverage is 33.22091% with 396 lines in your changes missing coverage. Please review.
✅ Project coverage is 19.14%. Comparing base (5ac08eb) to head (22fbe4e).
⚠️ Report is 12 commits behind head on next.

Files with missing lines Patch % Lines
crates/core/src/kernel/snapshot/serde.rs 0.00% 145 Missing ⚠️
crates/core/src/kernel/snapshot/mod.rs 46.54% 77 Missing and 8 partials ⚠️
crates/core/src/table/state.rs 3.12% 61 Missing and 1 partial ⚠️
crates/core/src/kernel/arrow/engine_ext.rs 57.84% 32 Missing and 11 partials ⚠️
crates/core/src/kernel/scalars.rs 0.00% 18 Missing ⚠️
crates/core/src/kernel/snapshot/replay.rs 0.00% 14 Missing ⚠️
crates/core/src/kernel/snapshot/stream.rs 78.78% 14 Missing ⚠️
crates/core/src/kernel/snapshot/log_data.rs 36.36% 2 Missing and 5 partials ⚠️
crates/core/src/kernel/transaction/mod.rs 0.00% 3 Missing ⚠️
crates/core/src/kernel/models/actions.rs 0.00% 2 Missing ⚠️
... and 3 more

❗ There is a different number of reports uploaded between BASE (5ac08eb) and HEAD (22fbe4e). Click for more details.

HEAD has 4 uploads less than BASE
Flag BASE (5ac08eb) HEAD (22fbe4e)
6 2
Additional details and impacted files
@@             Coverage Diff             @@
##             next    #3601       +/-   ##
===========================================
- Coverage   75.44%   19.14%   -56.31%     
===========================================
  Files         146       74       -72     
  Lines       44905    11907    -32998     
  Branches    44905    11907    -32998     
===========================================
- Hits        33878     2279    -31599     
- Misses       9210     9430      +220     
+ Partials     1817      198     -1619     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@ion-elgreco
Copy link
Collaborator

@roeap third time is a charm ^^

/// Trait for types that stream [RecordBatch]
///
/// See [`SendableRecordBatchStream`] for more details.
pub trait RecordBatchStream: Stream<Item = DeltaResult<RecordBatch>> {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess same remark as last time, why aren't you using existing traits for this that sit in Datafusion or arrow-rs? :)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think these are only in datafusion, however we'll always need them also when there is no datafusion dependency.

Tried looking in arrow, but did not find a stream one ...

Copy link
Collaborator

@ion-elgreco ion-elgreco Jul 18, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It does exist in arrow: https://docs.rs/arrow/latest/arrow/array/trait.RecordBatchReader.html but it's an iterator, not sure how stream is much different?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but thats an iterator, not a stream though?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's true yeah

@roeap roeap force-pushed the feat/log-replay branch from d01c138 to b13aa52 Compare July 19, 2025 10:07
@ion-elgreco
Copy link
Collaborator

I'm fine with merging it once the test pass! But it will prevent us releasing the python bindings until those missing features are added

@roeap
Copy link
Collaborator Author

roeap commented Jul 19, 2025

@ion-elgreco - just making sure you mean the stats_parsed one?

The other changes are likely permanent :). dataChange is a property only required in conflict resolution (I think), and the cdc files will (likely) just be replaced by a something scan_metadata like, just for CDF ...

That said, I might give upstreaming stats_parsed a try to move things along.

@roeap roeap force-pushed the feat/log-replay branch 2 times, most recently from a4bb309 to edc0e6c Compare July 19, 2025 20:03
@ion-elgreco
Copy link
Collaborator

ion-elgreco commented Jul 24, 2025

@ion-elgreco - just making sure you mean the stats_parsed one?

The other changes are likely permanent :). dataChange is a property only required in conflict resolution (I think), and the cdc files will (likely) just be replaced by a something scan_metadata like, just for CDF ...

That said, I might give upstreaming stats_parsed a try to move things along.

Yes the stats parsed mostly. The CDC files indeed was never used, and I understood from Stephen, that kernel has better ways of handling this

DataChange I am not sure tbh what the impact would be, didn't other writers rely on this?

@roeap
Copy link
Collaborator Author

roeap commented Jul 26, 2025

DataChange I am not sure tbh what the impact would be, didn't other writers rely on this?

The only real use case I see for this field is for conflict resolution. In that case though we cannot use the same methods as we do for the replay since we need more actions (txn's, ...) and also the data change field. So we would use the raw action stream from kernels log segment for getting actions in conflict resolution, where ew need the property ...

@roeap roeap force-pushed the feat/log-replay branch from a9b087d to fc9ac1b Compare July 26, 2025 10:12
@ion-elgreco
Copy link
Collaborator

Now that the other stuff has been merged and released, @roeap @rtyler should we get those final tests passing before we merge or we just merge it as-is?

@roeap
Copy link
Collaborator Author

roeap commented Aug 1, 2025

@ion-elgreco - I believe @rtyler investigated a bit and it may be that the failing S3 tests are maybe due to our test setup and not necessarily a bug in our code.

One idea from @rtyler (that I like :)) was to update our APIs to just take URLs (and leave the string parsing to urls up to the caller). Will experiment a bit how much of an impact that would be.

That said, I feel we somehow need to at some point just get it over with and then focus on improving test coverage for a while as we build out some features.

Signed-off-by: Robert Pack <robstar.pack@gmail.com>
@roeap roeap force-pushed the feat/log-replay branch from b7d3424 to f33ca2c Compare August 1, 2025 17:25
Signed-off-by: Robert Pack <robstar.pack@gmail.com>
@roeap roeap force-pushed the feat/log-replay branch from f33ca2c to 22fbe4e Compare August 1, 2025 17:36
@roeap roeap mentioned this pull request Aug 2, 2025
@rtyler rtyler changed the base branch from main to next August 2, 2025 18:13
@rtyler
Copy link
Member

rtyler commented Aug 2, 2025

I am landing this to a new topic branch next for us to collaborate in

@rtyler rtyler closed this Aug 2, 2025
@roeap roeap deleted the feat/log-replay branch August 16, 2025 22:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
binding/python Issues for the Python package binding/rust Issues for the Rust crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants