-
Notifications
You must be signed in to change notification settings - Fork 520
feat!: kernel log replay #3601
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat!: kernel log replay #3601
Conversation
1905331
to
9cf764c
Compare
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## next #3601 +/- ##
===========================================
- Coverage 75.44% 19.14% -56.31%
===========================================
Files 146 74 -72
Lines 44905 11907 -32998
Branches 44905 11907 -32998
===========================================
- Hits 33878 2279 -31599
- Misses 9210 9430 +220
+ Partials 1817 198 -1619 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
@roeap third time is a charm ^^ |
/// Trait for types that stream [RecordBatch] | ||
/// | ||
/// See [`SendableRecordBatchStream`] for more details. | ||
pub trait RecordBatchStream: Stream<Item = DeltaResult<RecordBatch>> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess same remark as last time, why aren't you using existing traits for this that sit in Datafusion or arrow-rs? :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i think these are only in datafusion, however we'll always need them also when there is no datafusion dependency.
Tried looking in arrow, but did not find a stream one ...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It does exist in arrow: https://docs.rs/arrow/latest/arrow/array/trait.RecordBatchReader.html but it's an iterator, not sure how stream is much different?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
but thats an iterator, not a stream though?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's true yeah
I'm fine with merging it once the test pass! But it will prevent us releasing the python bindings until those missing features are added |
@ion-elgreco - just making sure you mean the The other changes are likely permanent :). That said, I might give upstreaming |
a4bb309
to
edc0e6c
Compare
Yes the stats parsed mostly. The CDC files indeed was never used, and I understood from Stephen, that kernel has better ways of handling this DataChange I am not sure tbh what the impact would be, didn't other writers rely on this? |
The only real use case I see for this field is for conflict resolution. In that case though we cannot use the same methods as we do for the replay since we need more actions (txn's, ...) and also the data change field. So we would use the raw action stream from kernels log segment for getting actions in conflict resolution, where ew need the property ... |
@ion-elgreco - I believe @rtyler investigated a bit and it may be that the failing S3 tests are maybe due to our test setup and not necessarily a bug in our code. One idea from @rtyler (that I like :)) was to update our APIs to just take URLs (and leave the string parsing to urls up to the caller). Will experiment a bit how much of an impact that would be. That said, I feel we somehow need to at some point just get it over with and then focus on improving test coverage for a while as we build out some features. |
Signed-off-by: Robert Pack <robstar.pack@gmail.com>
Signed-off-by: Robert Pack <robstar.pack@gmail.com>
I am landing this to a new topic branch |
Description
This is the third instalment of trying to move our log replay to leverage delta-kernel. Hoping that this time the pill is just small enough to swallow :).
This introduces a regression in handling checkpoints where only
stats_parsed
is written and thestats
field is omitted. We can still read these tables, but there will be no file stats-based file skipping during a scan. This needs to be handled upstream.There are a few more behaviour changes:
cdc_files
method, as this one was not used anywhere and did not properly handle replay for cdc files.dataChange
ortags
field in add action data, so it is no longer part of theRecordBatch
returned foradd_action_table
.There is a lot more room for improvement but the aim for this PR was to try and keep the diff as "small" as possible (not claiming that this is a small diff though 😆). My hope is that we can land this PR and then quickly follow up with some more PRs that do some more focussed refactors.
Of course there are not only the aforementioned regressions, but also new features that are enabled via this PR. In theory v2 checkpoints should just work now, which is something to follow up with quickly and properly test ...
Also, one of the most awaited features - deletion vectors (#1094) - is now in reach, and something I want to address ASAP!
cc @zachschuermann