|
1 | | -#  |
2 | | -Test data to be used for automated testing with the nf-core pipelines |
| 1 | +# 🧬 Rare Disease Test Datasets |
3 | 2 |
|
4 | | -> ⚠️ **Do not merge your test data to `master`! Each pipeline has a dedicated branch (and a special one for modules)** |
| 3 | +This repository provides subsampled long-read sequencing datasets\*\* derived from HG002, designed for testing and validation of long-read rare disease analysis pipelines [nf-core/longraredisease](https://github.com/nf-core/longraredisease). |
5 | 4 |
|
6 | | -## Introduction |
| 5 | +All datasets are restricted to **chromosome 22 (first 50 Mb)** to minimise file sizes and speed up automated test runs. |
7 | 6 |
|
8 | | -nf-core is a collection of high quality Nextflow pipelines. This repository contains various files for CI and unit testing of nf-core pipelines and infrastructure. |
| 7 | +--- |
9 | 8 |
|
10 | | -The principle for nf-core test data is as small as possible, as large as necessary. Please see the [guidelines](https://nf-co.re/docs/contributing/test_data_guidelines) for more detailed information. Always ask for guidance on the [nf-core slack](https://nf-co.re/join) before adding new test data. |
| 9 | +## 📂 Repository Contents |
11 | 10 |
|
12 | | -## Documentation |
| 11 | +| Folder / File | Description | |
| 12 | +| ------------------- | ------------------------------------------------------------------------------------------ | |
| 13 | +| `ubam_file/` | Subsampled **unmapped BAM** files (uBAMs) for testing variant calling from unaligned data. | |
| 14 | +| `fastq_file/` | Subsampled **FASTQ** file generated from HG002 basecalled reads. | |
| 15 | +| `spectre/` | Example **VCF** and **BED** files for CNV detection testing with _Spectre_. | |
| 16 | +| `straglr/` | **STR test regions** (chromosome 22) for _STRaglr_ validation. | |
| 17 | +| `hificnv/` | **Exclude BED** regions used for chromosome 22 CNV benchmarking. | |
| 18 | +| `reference/` | Reduced **human genome reference**, containing only chromosome 22 (GRCh38). | |
| 19 | +| `samplesheet_*.csv` | Example **sample metadata** for automated pipeline test runs. | |
13 | 20 |
|
14 | | -nf-core/test-datasets comes with documentation in the `docs/` directory: |
| 21 | +--- |
15 | 22 |
|
16 | | -01. [Add a new test dataset](https://github.com/nf-core/test-datasets/blob/master/docs/ADD_NEW_DATA.md) |
17 | | -02. [Use an existing test dataset](https://github.com/nf-core/test-datasets/blob/master/docs/USE_EXISTING_DATA.md) |
| 23 | +## 🧪 Sample Overview |
18 | 24 |
|
19 | | -## Downloading test data |
| 25 | +| Column | Description | |
| 26 | +| ----------------------------------------- | ------------------------------------- | |
| 27 | +| `sample_id` | Unique identifier for the test sample | |
| 28 | +| `input_type` | Input data type (FASTQ, BAM, etc.) | |
| 29 | +| `file_path` | Direct download link to test data | |
| 30 | +| `hpo_terms` | Associated HPO phenotype terms | |
| 31 | +| `sex` | Biological sex | |
| 32 | +| `family_id`, `maternal_id`, `paternal_id` | Family metadata | |
20 | 33 |
|
21 | | -Due the large number of large files in this repository for each pipeline, we highly recommend cloning only the branches you would use. |
| 34 | +Example entry: |
| 35 | + |
| 36 | +``` |
| 37 | +sample_id,input_type,file_path,hpo_terms,sex,family_id,maternal_id,paternal_id |
| 38 | +test,fastq,https://raw.githubusercontent.com/nourmahfel/test-datasets/longraredisease/fastq_file/hg002_subset.fastq.gz,HP:0002721;HP:0002110;HP:0500093;HP:0000717;HP:0001263;HP:0001763;HP:0003298;HP:0002857;HP:0001382,F,family_21,null,null |
| 39 | +``` |
| 40 | + |
| 41 | +--- |
| 42 | + |
| 43 | +## ⚙️ Usage |
| 44 | + |
| 45 | +These datasets are intended for **automated pipeline testing**, enabling quick validation of the full _long-read rare disease analysis_ workflow — from unaligned reads through to variant calling and annotation. |
| 46 | + |
| 47 | +Example Nextflow test run: |
22 | 48 |
|
23 | 49 | ```bash |
24 | | -git clone <url> --single-branch --branch <pipeline/modules/branch_name> |
| 50 | +nextflow run nf-core/nanoraredx -profile test,docker |
25 | 51 | ``` |
26 | 52 |
|
27 | | -To subsequently clone other branches[^1] |
| 53 | +The repository includes a `test.config` file containing preset paths and parameters used for CI and development validation. |
| 54 | + |
| 55 | +--- |
| 56 | + |
| 57 | +## 🧩 Data Generation Workflow |
| 58 | + |
| 59 | +The following steps describe how each dataset was created from **HG002 data**. |
| 60 | + |
| 61 | +### 1️⃣ Extract 50 Mb region from chromosome 22 |
| 62 | + |
| 63 | +A compact subset was created to minimise storage and runtime while preserving data realism: |
28 | 64 |
|
29 | 65 | ```bash |
30 | | -git remote set-branches --add origin [remote-branch] |
31 | | -git fetch |
| 66 | +samtools view -b calls.sorted.bam chr22:1-50000000 > chr22_50mb.bam |
| 67 | +samtools index chr22_50mb.bam |
32 | 68 | ``` |
33 | 69 |
|
34 | | -## Support |
| 70 | +This produced a **50 Mb** region representing chromosome 22 (`chr22_50mb.bam`). |
| 71 | + |
| 72 | +--- |
| 73 | + |
| 74 | +### 2️⃣ Generate FASTQ file |
| 75 | + |
| 76 | +The BAM file was converted to FASTQ format to simulate basecalled reads: |
| 77 | + |
| 78 | +```bash |
| 79 | +samtools fastq chr22_50mb.bam > hg002_subset.fastq |
| 80 | +gzip hg002_subset.fastq |
| 81 | +``` |
| 82 | + |
| 83 | +Output: |
| 84 | + |
| 85 | +- `hg002_subset.fastq.gz` → FASTQ dataset for testing pipeline entry from raw reads. |
| 86 | + |
| 87 | +--- |
| 88 | + |
| 89 | +### 3️⃣ Create unmapped BAM (uBAM) |
| 90 | + |
| 91 | +To test the alignment and variant calling stages from unaligned data, an **unmapped BAM** version was generated: |
| 92 | + |
| 93 | +```bash |
| 94 | +samtools view -h chr22_50mb.bam | awk '$3=="*" || /^@/' | samtools view -b -o hg002_subset.ubam |
| 95 | +``` |
| 96 | + |
| 97 | +This file retains read names, qualities, and tags but removes alignment fields (RNAME, POS, CIGAR, etc.). |
| 98 | + |
| 99 | +Output: |
| 100 | + |
| 101 | +- `hg002_subset.ubam` → unaligned BAM file suitable for pipeline tests starting from mapping. |
| 102 | + |
| 103 | +--- |
| 104 | + |
| 105 | +## 📦 Summary of Derived Outputs |
| 106 | + |
| 107 | +| File | Description | Source | |
| 108 | +| ----------------------- | ---------------------------------- | ---------------- | |
| 109 | +| `hg002_subset.fastq.gz` | Subsampled FASTQ (50 Mb region) | `chr22_50mb.bam` | |
| 110 | +| `hg002_subset.ubam` | Unmapped BAM for alignment testing | `chr22_50mb.bam` | |
| 111 | +| `reference/chr22.fasta` | Reduced genome reference | GRCh38 | |
| 112 | + |
| 113 | +--- |
35 | 114 |
|
36 | | -For further information or help, don't hesitate to get in touch on our [Slack organisation](https://nf-co.re/join/slack) (a tool for instant messaging). |
| 115 | +## 📄 License and Attribution |
37 | 116 |
|
38 | | -[^1]: From [stackoverflow](https://stackoverflow.com/a/60846265/11502856) |
| 117 | +Data derived from **HG002** (Genome in a Bottle Consortium). |
| 118 | +Please cite **GIAB** and relevant tools when reusing or redistributing these datasets. |
0 commit comments