feat(libcommon): add duckdb-based leak detection prototype (#2994) #3202
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This PR adds a prototype module for detecting data leaks and duplicates between dataset splits, as suggested in issue #2994 and inspired by the LLE blog post.
🔍 What it does:
detect_leaks(dataset_name, subset=None)inlibs/libcommon/src/libcommon/leak_detection.pytokens,text, and optionallyner_tags🛠️ Example usage:
Will output stats for
conll2003like:📌 Notes:
This is a standalone prototype, not yet wired into backend jobs or UI components.
No local test coverage is added at this point, as it's an early isolated utility.
Future steps may include:
Closes #2994