Skip to content

Conversation

@ArjunJagdale
Copy link
Contributor

This PR adds a prototype module for detecting data leaks and duplicates between dataset splits, as suggested in issue #2994 and inspired by the LLE blog post.

🔍 What it does:

  • Introduces detect_leaks(dataset_name, subset=None) in libs/libcommon/src/libcommon/leak_detection.py
  • Uses DuckDB to efficiently compute:
    • Leaks from train → test
    • Leaks from validation → test
    • Duplicates within train/validation/test splits
  • Supports datasets with tokens, text, and optionally ner_tags

🛠️ Example usage:

python libs/libcommon/src/libcommon/leak_detection.py

Will output stats for conll2003 like:

train_test_leaks: 78
validation_test_leaks: 25
train_dup: 1350
validation_dup: 180
test_dup: 269

📌 Notes:

  • This is a standalone prototype, not yet wired into backend jobs or UI components.

  • No local test coverage is added at this point, as it's an early isolated utility.

  • Future steps may include:

    • Integrating into dataset QA dashboards
    • Running this check on newly uploaded datasets
    • Displaying results in the Hub

Closes #2994

…ce#2994)

This PR adds a prototype module for detecting data leaks and duplicates between dataset splits, as suggested in issue huggingface#2994 and inspired by the [LLE blog post](https://huggingface.co/blog/lbourdois/lle).

### 🔍 What it does:
- Introduces `detect_leaks(dataset_name, subset=None)` in `libs/libcommon/src/libcommon/leak_detection.py`
- Uses DuckDB to efficiently compute:
  - Leaks from train → test
  - Leaks from validation → test
  - Duplicates within train/validation/test splits
- Supports datasets with `tokens`, `text`, and optionally `ner_tags`

### 🛠️ Example usage:
```bash
python libs/libcommon/src/libcommon/leak_detection.py
Will output stats for conll2003 like:

yaml
Copy
Edit
train_test_leaks: 78
validation_test_leaks: 25
train_dup: 1350
validation_dup: 180
test_dup: 269
📌 Notes:
This is a standalone prototype and not yet wired into jobs or UI.

Future steps may include integrating this into dataset QA dashboards or automated Hub checks.

Closes huggingface#2994
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Compute leaks between splits?

1 participant