Skip to content

Conversation

@lliangyu-lin
Copy link
Contributor

Description

  • Add delete files snapshot producer API for marking manifest entry as deleted based on a predicate
  • Being used as part of [DRAFT] feat(table): Support table delete with copy on write #518
  • This only helps deletes on manifest level when the entire partition or data file matched the predicate. For partial matched data file, a rewrite is still needed in the table delete.

Testing

Pending to add unit tests

err := df.computeDeletes(iceberg.EqualTo(iceberg.Reference("foo"), true), true)
t.Require().NoError(err)

updates, reqs, err := updater.commit()
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like we need an idiomatic way to override the commit in snapshotUpdate. When no entries were deleted, updates and requirements should be empty instead of creating a snapshot with no changes.
https://github.com/apache/iceberg-python/blob/main/pyiceberg/table/update/snapshot.py#L370-L373

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

makes sense. Perhaps we need another base function in the impls to allow the producer to either override the commit method or otherwise indicate there was no change and thus no need to create a new snapshot

Copy link
Member

@zeroshade zeroshade left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we also add a DeleteByPredicate method?

snapshotProps iceberg.Properties,
) *snapshotProducer {
prod := createSnapshotProducer(op, txn, fs, commitUUID, snapshotProps)
prod.producerImpl = &deleteFiles{
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shouldn't we default predicate to AlwaysFalse{}?

predicate iceberg.BooleanExpression
caseSensitive bool

computed bool
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what does computed represent?

Comment on lines 448 to 452
partitionFilter, err := project(df.predicate)
if err != nil {
return nil, err
}
return partitionFilter, nil
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you can just do return project(df.predicate)

Comment on lines +591 to +600
func (df *deleteFiles) ensureComputed() error {
if !df.computed {
err := df.computeDeletes(iceberg.AlwaysFalse{}, true)
if err != nil {
return err
}
}

return nil
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You've got a race condition here. You need to lock around this or use a sync.Once, or something equivalent.

err := df.computeDeletes(iceberg.EqualTo(iceberg.Reference("foo"), true), true)
t.Require().NoError(err)

updates, reqs, err := updater.commit()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

makes sense. Perhaps we need another base function in the impls to allow the producer to either override the commit method or otherwise indicate there was no change and thus no need to create a new snapshot

@lliangyu-lin
Copy link
Contributor Author

@zeroshade Thank you for the feedbacks!
There are actually quiet few things I'm still fixing locally and addressed few of your comments as well. I'll tag you for another round of review once I address all the issues mentioned.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants