Skip to content

Commit 7bf9c04

Browse files
authored
Merge pull request #1736 from nourmahfel/longraredisease
Longraredisease
2 parents 5815d2c + 85f6a86 commit 7bf9c04

File tree

11 files changed

+777
-19
lines changed

11 files changed

+777
-19
lines changed

README.md

Lines changed: 99 additions & 19 deletions
Original file line numberDiff line numberDiff line change
@@ -1,38 +1,118 @@
1-
# ![nfcore/test-datasets](docs/images/test-datasets_logo.png)
2-
Test data to be used for automated testing with the nf-core pipelines
1+
# 🧬 Rare Disease Test Datasets
32

4-
> ⚠️ **Do not merge your test data to `master`! Each pipeline has a dedicated branch (and a special one for modules)**
3+
This repository provides subsampled long-read sequencing datasets\*\* derived from HG002, designed for testing and validation of long-read rare disease analysis pipelines [nf-core/longraredisease](https://github.com/nf-core/longraredisease).
54

6-
## Introduction
5+
All datasets are restricted to **chromosome 22 (first 50 Mb)** to minimise file sizes and speed up automated test runs.
76

8-
nf-core is a collection of high quality Nextflow pipelines. This repository contains various files for CI and unit testing of nf-core pipelines and infrastructure.
7+
---
98

10-
The principle for nf-core test data is as small as possible, as large as necessary. Please see the [guidelines](https://nf-co.re/docs/contributing/test_data_guidelines) for more detailed information. Always ask for guidance on the [nf-core slack](https://nf-co.re/join) before adding new test data.
9+
## 📂 Repository Contents
1110

12-
## Documentation
11+
| Folder / File | Description |
12+
| ------------------- | ------------------------------------------------------------------------------------------ |
13+
| `ubam_file/` | Subsampled **unmapped BAM** files (uBAMs) for testing variant calling from unaligned data. |
14+
| `fastq_file/` | Subsampled **FASTQ** file generated from HG002 basecalled reads. |
15+
| `spectre/` | Example **VCF** and **BED** files for CNV detection testing with _Spectre_. |
16+
| `straglr/` | **STR test regions** (chromosome 22) for _STRaglr_ validation. |
17+
| `hificnv/` | **Exclude BED** regions used for chromosome 22 CNV benchmarking. |
18+
| `reference/` | Reduced **human genome reference**, containing only chromosome 22 (GRCh38). |
19+
| `samplesheet_*.csv` | Example **sample metadata** for automated pipeline test runs. |
1320

14-
nf-core/test-datasets comes with documentation in the `docs/` directory:
21+
---
1522

16-
01. [Add a new test dataset](https://github.com/nf-core/test-datasets/blob/master/docs/ADD_NEW_DATA.md)
17-
02. [Use an existing test dataset](https://github.com/nf-core/test-datasets/blob/master/docs/USE_EXISTING_DATA.md)
23+
## 🧪 Sample Overview
1824

19-
## Downloading test data
25+
| Column | Description |
26+
| ----------------------------------------- | ------------------------------------- |
27+
| `sample_id` | Unique identifier for the test sample |
28+
| `input_type` | Input data type (FASTQ, BAM, etc.) |
29+
| `file_path` | Direct download link to test data |
30+
| `hpo_terms` | Associated HPO phenotype terms |
31+
| `sex` | Biological sex |
32+
| `family_id`, `maternal_id`, `paternal_id` | Family metadata |
2033

21-
Due the large number of large files in this repository for each pipeline, we highly recommend cloning only the branches you would use.
34+
Example entry:
35+
36+
```
37+
sample_id,input_type,file_path,hpo_terms,sex,family_id,maternal_id,paternal_id
38+
test,fastq,https://raw.githubusercontent.com/nourmahfel/test-datasets/longraredisease/fastq_file/hg002_subset.fastq.gz,HP:0002721;HP:0002110;HP:0500093;HP:0000717;HP:0001263;HP:0001763;HP:0003298;HP:0002857;HP:0001382,F,family_21,null,null
39+
```
40+
41+
---
42+
43+
## ⚙️ Usage
44+
45+
These datasets are intended for **automated pipeline testing**, enabling quick validation of the full _long-read rare disease analysis_ workflow — from unaligned reads through to variant calling and annotation.
46+
47+
Example Nextflow test run:
2248

2349
```bash
24-
git clone <url> --single-branch --branch <pipeline/modules/branch_name>
50+
nextflow run nf-core/nanoraredx -profile test,docker
2551
```
2652

27-
To subsequently clone other branches[^1]
53+
The repository includes a `test.config` file containing preset paths and parameters used for CI and development validation.
54+
55+
---
56+
57+
## 🧩 Data Generation Workflow
58+
59+
The following steps describe how each dataset was created from **HG002 data**.
60+
61+
### 1️⃣ Extract 50 Mb region from chromosome 22
62+
63+
A compact subset was created to minimise storage and runtime while preserving data realism:
2864

2965
```bash
30-
git remote set-branches --add origin [remote-branch]
31-
git fetch
66+
samtools view -b calls.sorted.bam chr22:1-50000000 > chr22_50mb.bam
67+
samtools index chr22_50mb.bam
3268
```
3369

34-
## Support
70+
This produced a **50 Mb** region representing chromosome 22 (`chr22_50mb.bam`).
71+
72+
---
73+
74+
### 2️⃣ Generate FASTQ file
75+
76+
The BAM file was converted to FASTQ format to simulate basecalled reads:
77+
78+
```bash
79+
samtools fastq chr22_50mb.bam > hg002_subset.fastq
80+
gzip hg002_subset.fastq
81+
```
82+
83+
Output:
84+
85+
- `hg002_subset.fastq.gz` → FASTQ dataset for testing pipeline entry from raw reads.
86+
87+
---
88+
89+
### 3️⃣ Create unmapped BAM (uBAM)
90+
91+
To test the alignment and variant calling stages from unaligned data, an **unmapped BAM** version was generated:
92+
93+
```bash
94+
samtools view -h chr22_50mb.bam | awk '$3=="*" || /^@/' | samtools view -b -o hg002_subset.ubam
95+
```
96+
97+
This file retains read names, qualities, and tags but removes alignment fields (RNAME, POS, CIGAR, etc.).
98+
99+
Output:
100+
101+
- `hg002_subset.ubam` → unaligned BAM file suitable for pipeline tests starting from mapping.
102+
103+
---
104+
105+
## 📦 Summary of Derived Outputs
106+
107+
| File | Description | Source |
108+
| ----------------------- | ---------------------------------- | ---------------- |
109+
| `hg002_subset.fastq.gz` | Subsampled FASTQ (50 Mb region) | `chr22_50mb.bam` |
110+
| `hg002_subset.ubam` | Unmapped BAM for alignment testing | `chr22_50mb.bam` |
111+
| `reference/chr22.fasta` | Reduced genome reference | GRCh38 |
112+
113+
---
35114

36-
For further information or help, don't hesitate to get in touch on our [Slack organisation](https://nf-co.re/join/slack) (a tool for instant messaging).
115+
## 📄 License and Attribution
37116

38-
[^1]: From [stackoverflow](https://stackoverflow.com/a/60846265/11502856)
117+
Data derived from **HG002** (Genome in a Bottle Consortium).
118+
Please cite **GIAB** and relevant tools when reusing or redistributing these datasets.

fastq_file/hg002_subset.fastq.gz

33.1 MB
Binary file not shown.

0 commit comments

Comments
 (0)