|
2 | 2 |
|
3 | 3 | Template for converting BIDS neuroimaging datasets (e.g., ARC, SOOP) into Hugging Face Datasets with NIfTI + tabular features. |
4 | 4 |
|
| 5 | +> ⚠️ **CRITICAL DEPENDENCY WARNING**: This template requires installing `datasets` from GitHub (not PyPI). See [Critical Dependency Requirement](#critical-dependency-requirement-huggingface-datasets) below. |
| 6 | +
|
5 | 7 | ## Overview |
6 | 8 |
|
7 | 9 | This repository provides a **reusable template** for: |
@@ -43,6 +45,62 @@ uv run hf-bids-nifti --help |
43 | 45 |
|
44 | 46 | > **Note:** ARC and SOOP commands are templates that will raise `NotImplementedError` until their file-table builders are implemented. |
45 | 47 |
|
| 48 | +## Critical Dependency Requirement: HuggingFace Datasets |
| 49 | + |
| 50 | +### The Problem |
| 51 | + |
| 52 | +The stable release of `datasets` (PyPI versions 3.x, 4.x including 4.4.1) has a **critical bug** where NIfTI files are uploaded as **empty bytes (0 bytes)** to HuggingFace Hub. This happens because `Nifti.embed_storage` was broken in stable releases. |
| 53 | + |
| 54 | +- **Silent failure**: `push_to_hub(embed_external_files=True)` uploads 0-byte files |
| 55 | +- **No error raised**: Your dataset appears to have data but all NIfTI files are empty |
| 56 | +- **Only visible when loading**: `load_dataset()` returns empty/corrupted images |
| 57 | + |
| 58 | +See [huggingface/datasets#7815](https://github.com/huggingface/datasets/pull/7815) for the original Nifti support PR and subsequent bug reports. |
| 59 | + |
| 60 | +### The Fix |
| 61 | + |
| 62 | +Install `datasets` from the GitHub main branch (dev version 4.4.2.dev0 or later). This template is pre-configured to do this via `[tool.uv.sources]` in `pyproject.toml`. |
| 63 | + |
| 64 | +### Verification |
| 65 | + |
| 66 | +After installation, verify the version: |
| 67 | + |
| 68 | +```python |
| 69 | +import datasets |
| 70 | +print(datasets.__version__) # Should show "4.4.2.dev0" or similar dev version |
| 71 | +``` |
| 72 | + |
| 73 | +### Manual Installation (if not using this template) |
| 74 | + |
| 75 | +**For uv (pyproject.toml):** |
| 76 | + |
| 77 | +```toml |
| 78 | +[project] |
| 79 | +dependencies = [ |
| 80 | + "datasets>=4.4.0", # Minimum version for Nifti support |
| 81 | + "huggingface-hub>=0.32.0", # Required for XET storage |
| 82 | + # ... other deps |
| 83 | +] |
| 84 | + |
| 85 | +[tool.uv.sources] |
| 86 | +# CRITICAL: Override datasets to use git version for Nifti.embed_storage fix |
| 87 | +datasets = { git = "https://github.com/huggingface/datasets.git" } |
| 88 | +``` |
| 89 | + |
| 90 | +**For pip/requirements.txt:** |
| 91 | + |
| 92 | +```text |
| 93 | +datasets @ git+https://github.com/huggingface/datasets.git |
| 94 | +``` |
| 95 | + |
| 96 | +**Direct uv command:** |
| 97 | + |
| 98 | +```bash |
| 99 | +uv add "datasets @ git+https://github.com/huggingface/datasets.git" |
| 100 | +``` |
| 101 | + |
| 102 | +> **Note**: This requirement will change once the fix is merged into a stable release. Check [HuggingFace datasets releases](https://github.com/huggingface/datasets/releases) for updates. |
| 103 | +
|
46 | 104 | ## Project Structure |
47 | 105 |
|
48 | 106 | ``` |
|
0 commit comments