Skip to content

Commit a1c66a3

Browse files
authored
Merge pull request #1603 from nf-core/bam_entry_point
Add bam input pathway
2 parents 072dbcc + 8fcb935 commit a1c66a3

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

45 files changed

+1457
-391
lines changed

CHANGELOG.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -14,6 +14,7 @@ Special thanks to the following for their contributions to the release:
1414
### Enhancements & fixes
1515

1616
- [PR #1597](https://github.com/nf-core/rnaseq/pull/1597) - Bump version after release 3.20.0
17+
- [PR #1603](https://github.com/nf-core/rnaseq/pull/1603) - Add bam input pathway
1718

1819
## [[3.20.0](https://github.com/nf-core/rnaseq/releases/tag/3.20.0)] - 2025-08-18
1920

README.md

Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -20,7 +20,7 @@
2020

2121
## Introduction
2222

23-
**nf-core/rnaseq** is a bioinformatics pipeline that can be used to analyse RNA sequencing data obtained from organisms with a reference genome and annotation. It takes a samplesheet and FASTQ files as input, performs quality control (QC), trimming and (pseudo-)alignment, and produces a gene expression matrix and extensive QC report.
23+
**nf-core/rnaseq** is a bioinformatics pipeline that can be used to analyse RNA sequencing data obtained from organisms with a reference genome and annotation. It takes a samplesheet with FASTQ files or pre-aligned BAM files as input, performs quality control (QC), trimming and (pseudo-)alignment, and produces a gene expression matrix and extensive QC report.
2424

2525
![nf-core/rnaseq metro map](docs/images/nf-core-rnaseq_metro_map_grey_animated.svg)
2626

@@ -76,6 +76,8 @@ CONTROL_REP1,AEG588A1_S1_L004_R1_001.fastq.gz,AEG588A1_S1_L004_R2_001.fastq.gz,a
7676

7777
Each row represents a fastq file (single-end) or a pair of fastq files (paired end). Rows with the same sample identifier are considered technical replicates and merged automatically. The strandedness refers to the library preparation and will be automatically inferred if set to `auto`.
7878

79+
The pipeline supports a two-step reprocessing workflow using BAM files from previous runs. Run initially with `--save_align_intermeds` to generate a samplesheet with BAM paths, then reprocess using `--skip_alignment` for efficient downstream analysis without repeating expensive alignment steps. This feature is designed specifically for pipeline-generated BAMs.
80+
7981
> [!WARNING]
8082
> Please provide pipeline parameters via the CLI or Nextflow `-params-file` option. Custom config files including those provided by the `-c` Nextflow option can be used to provide any configuration _**except for parameters**_; see [docs](https://nf-co.re/docs/usage/getting_started/configuration#custom-configuration-files).
8183

assets/schema_input.json

Lines changed: 21 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -32,6 +32,27 @@
3232
"errorMessage": "Strandedness must be provided and be one of 'auto', 'forward', 'reverse' or 'unstranded'",
3333
"enum": ["forward", "reverse", "unstranded", "auto"],
3434
"meta": ["strandedness"]
35+
},
36+
"genome_bam": {
37+
"type": "string",
38+
"format": "file-path",
39+
"exists": true,
40+
"pattern": "^([\\S\\s]*\\/)?[^\\s\\/]+\\.(bam|BAM)$",
41+
"errorMessage": "Genome BAM file cannot contain spaces and must have extension '.bam'"
42+
},
43+
"transcriptome_bam": {
44+
"type": "string",
45+
"format": "file-path",
46+
"exists": true,
47+
"pattern": "^([\\S\\s]*\\/)?[^\\s\\/]+\\.(bam|BAM)$",
48+
"errorMessage": "Transcriptome BAM file cannot contain spaces and must have extension '.bam'"
49+
},
50+
"percent_mapped": {
51+
"type": "number",
52+
"minimum": 0,
53+
"maximum": 100,
54+
"errorMessage": "Percent mapped must be a number between 0 and 100",
55+
"meta": "percent_mapped"
3556
}
3657
},
3758
"required": ["sample", "fastq_1", "strandedness"]

docs/output.md

Lines changed: 12 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -10,6 +10,10 @@ nextflow run nf-core/rnaseq -profile test_full,<docker/singularity/institute>
1010

1111
The directories listed below will be created in the results directory after the pipeline has finished. All paths are relative to the top-level results directory.
1212

13+
:::tip
14+
Many of the BAM files produced by this pipeline can be reused as input for future runs. This is particularly useful for reprocessing data or running downstream analysis steps without repeating computationally expensive alignment. See the [usage documentation](https://nf-co.re/rnaseq/usage#using-bam-files-as-input) for details on using BAM files as input.
15+
:::
16+
1317
## Pipeline overview
1418

1519
The pipeline is built using [Nextflow](https://www.nextflow.io/) and processes data using the following steps:
@@ -213,8 +217,8 @@ When `--remove_ribo_rna` is specified, the pipeline uses [SortMeRNA](https://git
213217
<summary>Output files</summary>
214218

215219
- `star_salmon/`
216-
- `*.Aligned.out.bam`: If `--save_align_intermeds` is specified the original BAM file containing read alignments to the reference genome will be placed in this directory.
217-
- `*.Aligned.toTranscriptome.out.bam`: If `--save_align_intermeds` is specified the original BAM file containing read alignments to the transcriptome will be placed in this directory.
220+
- `*.Aligned.out.bam`: If `--save_align_intermeds` is specified the original BAM file containing read alignments to the reference genome will be placed in this directory. These files can be reused as `genome_bam` input in future pipeline runs.
221+
- `*.Aligned.toTranscriptome.out.bam`: If `--save_align_intermeds` is specified the original BAM file containing read alignments to the transcriptome will be placed in this directory. These files can be reused as `transcriptome_bam` input in future pipeline runs.
218222
- `salmon.merged.gene_counts.tsv`: Matrix of gene-level raw counts across all samples.
219223
- `salmon.merged.gene_tpm.tsv`: Matrix of gene-level TPM values across all samples.
220224
- `salmon.merged.gene.SummarizedExperiment.rds`: RDS object that can be loaded in R that contains a [SummarizedExperiment](https://bioconductor.org/packages/release/bioc/html/SummarizedExperiment.html) container with the abundance TPM (`tpm`), estimated counts (`counts`) and gene length (`length`), estimated library size-scaled counts (`counts_scaled`), estimated length-scaled counts (`counts_length_scaled`) in the assays slot for genes.
@@ -276,8 +280,8 @@ The STAR section of the MultiQC report shows a bar plot with alignment rates: go
276280
- `rsem.merged.transcript_tpm.tsv`: Matrix of isoform-level TPM values across all samples.
277281
- `*.genes.results`: RSEM gene-level quantification results for each sample.
278282
- `*.isoforms.results`: RSEM isoform-level quantification results for each sample.
279-
- `*.STAR.genome.bam`: If `--save_align_intermeds` is specified the original BAM file containing read alignments to the reference genome will be placed in this directory.
280-
- `*.transcript.bam`: If `--save_align_intermeds` is specified the original BAM file containing read alignments to the transcriptome will be placed in this directory.
283+
- `*.STAR.genome.bam`: If `--save_align_intermeds` is specified the original BAM file containing read alignments to the reference genome will be placed in this directory. These files can be reused as `genome_bam` input in future pipeline runs.
284+
- `*.transcript.bam`: If `--save_align_intermeds` is specified the original BAM file containing read alignments to the transcriptome will be placed in this directory. These files can be reused as `transcriptome_bam` input in future pipeline runs.
281285
- `star_rsem/<SAMPLE>.stat/`
282286
- `*.cnt`, `*.model`, `*.theta`: RSEM counts and statistics for each sample.
283287
- `star_rsem/log/`
@@ -299,7 +303,7 @@ You can choose to align and quantify your data with RSEM by providing the `--ali
299303
<summary>Output files</summary>
300304

301305
- `hisat2/`
302-
- `<SAMPLE>.bam`: If `--save_align_intermeds` is specified the original BAM file containing read alignments to the reference genome will be placed in this directory.
306+
- `<SAMPLE>.bam`: If `--save_align_intermeds` is specified the original BAM file containing read alignments to the reference genome will be placed in this directory. These files can be reused as `genome_bam` input in future pipeline runs.
303307
- `hisat2/log/`
304308
- `*.log`: HISAT2 alignment report containing the mapping results summary.
305309
- `hisat2/unmapped/`
@@ -323,7 +327,7 @@ The pipeline has been written in a way where all the files generated downstream
323327
<summary>Output files</summary>
324328

325329
- `<ALIGNER>/`
326-
- `<SAMPLE>.sorted.bam`: If `--save_align_intermeds` is specified the original coordinate sorted BAM file containing read alignments will be placed in this directory.
330+
- `<SAMPLE>.sorted.bam`: If `--save_align_intermeds` is specified the original coordinate sorted BAM file containing read alignments will be placed in this directory. These files can be reused as `genome_bam` input in future pipeline runs.
327331
- `<SAMPLE>.sorted.bam.bai`: If `--save_align_intermeds` is specified the BAI index file for the original coordinate sorted BAM file will be placed in this directory.
328332
- `<SAMPLE>.sorted.bam.csi`: If `--save_align_intermeds --bam_csi_index` is specified the CSI index file for the original coordinate sorted BAM file will be placed in this directory.
329333
- `<ALIGNER>/samtools_stats/`
@@ -864,6 +868,8 @@ A number of genome-specific files are generated by the pipeline because they are
864868
- Reports generated by the pipeline: `pipeline_report.html`, `pipeline_report.txt` and `software_versions.yml`. The `pipeline_report*` files will only be present if the `--email` / `--email_on_fail` parameter's are used when running the pipeline.
865869
- Reformatted samplesheet files used as input to the pipeline: `samplesheet.valid.csv`.
866870
- Parameters used by the pipeline run: `params.json`.
871+
- `samplesheets/`
872+
- `samplesheet_with_bams.csv`: **Auto-generated complete samplesheet** (only created when using `--save_align_intermeds`) containing all samples with BAM file paths. For samples processed from FASTQ, includes paths to newly generated BAMs; for samples that were BAM input, preserves the original input paths. This comprehensive samplesheet can be used directly for future pipeline runs, enabling efficient reprocessing without re-alignment.
867873

868874
</details>
869875

docs/usage.md

Lines changed: 64 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -92,17 +92,75 @@ TREATMENT_REP3,AEG588A6_S6_L003_R1_001.fastq.gz,,reverse
9292
TREATMENT_REP3,AEG588A6_S6_L004_R1_001.fastq.gz,,reverse
9393
```
9494

95-
| Column | Description |
96-
| -------------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
97-
| `sample` | Custom sample name. This entry will be identical for multiple sequencing libraries/runs from the same sample. Spaces in sample names are automatically converted to underscores (`_`). |
98-
| `fastq_1` | Full path to FastQ file for Illumina short reads 1. File has to be gzipped and have the extension ".fastq.gz" or ".fq.gz". |
99-
| `fastq_2` | Full path to FastQ file for Illumina short reads 2. File has to be gzipped and have the extension ".fastq.gz" or ".fq.gz". |
100-
| `strandedness` | Sample strand-specificity. Must be one of `unstranded`, `forward`, `reverse` or `auto`. |
95+
| Column | Description |
96+
| ------------------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
97+
| `sample` | Custom sample name. This entry will be identical for multiple sequencing libraries/runs from the same sample. Spaces in sample names are automatically converted to underscores (`_`). |
98+
| `fastq_1` | Full path to FastQ file for Illumina short reads 1. File has to be gzipped and have the extension ".fastq.gz" or ".fq.gz". |
99+
| `fastq_2` | Full path to FastQ file for Illumina short reads 2. File has to be gzipped and have the extension ".fastq.gz" or ".fq.gz". |
100+
| `strandedness` | Sample strand-specificity. Must be one of `unstranded`, `forward`, `reverse` or `auto`. |
101+
| `genome_bam` | **Optional**. Full path to genome-aligned BAM file. Typically from previous pipeline runs (see [output documentation](https://nf-co.re/rnaseq/output#star-salmon-and-kallisto) or [STAR/RSEM](https://nf-co.re/rnaseq/output#star-via-rsem)). |
102+
| `transcriptome_bam` | **Optional**. Full path to transcriptome-aligned BAM file. Typically from previous pipeline runs (see [output documentation](https://nf-co.re/rnaseq/output#star-salmon-and-kallisto) or [STAR/RSEM](https://nf-co.re/rnaseq/output#star-via-rsem)). |
103+
| `percent_mapped` | **Optional**. Percentage of reads that mapped during alignment (0-100). Useful for quality assessment and filtering. |
101104

102105
An [example samplesheet](../assets/samplesheet.csv) has been provided with the pipeline.
103106

104107
> **NB:** The `group` and `replicate` columns were replaced with a single `sample` column as of v3.1 of the pipeline. The `sample` column is essentially a concatenation of the `group` and `replicate` columns, however it now also offers more flexibility in instances where replicate information is not required e.g. when sequencing clinical samples. If all values of `sample` have the same number of underscores, fields defined by these underscore-separated names may be used in the PCA plots produced by the pipeline, to regain the ability to represent different groupings.
105108
109+
### BAM input for reprocessing workflow
110+
111+
The pipeline supports a **two-step workflow** for efficient reprocessing without expensive alignment steps. This feature is designed specifically for re-running with BAM files generated by previous runs of this same pipeline.
112+
113+
#### Step 1: Initial run with BAM generation
114+
115+
Run the pipeline normally, adding `--save_align_intermeds` to publish BAM files and generate a reusable samplesheet:
116+
117+
```bash
118+
nextflow run nf-core/rnaseq \
119+
--input samplesheet.csv \
120+
--save_align_intermeds \
121+
--outdir results_initial \
122+
-profile docker
123+
```
124+
125+
This creates `samplesheets/samplesheet_with_bams.csv` containing paths to the generated BAM files.
126+
127+
#### Step 2: Reprocessing run with BAM input
128+
129+
Use the auto-generated samplesheet to reprocess data, skipping alignment:
130+
131+
```bash
132+
nextflow run nf-core/rnaseq \
133+
--input samplesheets/samplesheet_with_bams.csv \
134+
--skip_alignment \
135+
--outdir results_reprocessed \
136+
-profile docker
137+
```
138+
139+
The pipeline will skip alignment and indexing steps, putting the BAM files through post-processing and quantification only.
140+
141+
#### Example of generated samplesheet
142+
143+
The `samplesheet_with_bams.csv` will look like:
144+
145+
```csv
146+
sample,fastq_1,fastq_2,strandedness,genome_bam,percent_mapped,transcriptome_bam
147+
SAMPLE1,/path/sample1_R1.fastq.gz,/path/sample1_R2.fastq.gz,forward,results/star_salmon/SAMPLE1.markdup.sorted.bam,85.2,results/star_salmon/SAMPLE1.Aligned.toTranscriptome.out.bam
148+
SAMPLE2,/path/sample2_R1.fastq.gz,,reverse,results/star_salmon/SAMPLE2.sorted.bam,92.1,results/star_salmon/SAMPLE2.Aligned.toTranscriptome.out.bam
149+
```
150+
151+
#### Important limitations
152+
153+
> **⚠️ Warning**: This feature is designed specifically for BAM files generated by this pipeline. Using arbitrary BAM files from other sources is **not officially supported** and will likely only work via the two-step workflow described above. Users attempting to use other BAMs do so at their own risk.
154+
155+
**Key technical details:**
156+
157+
- The pipeline automatically indexes provided BAM files
158+
- You can provide just `genome_bam`, just `transcriptome_bam`, or both
159+
- Mixed samplesheets (some samples with FASTQ, others with BAM) are supported
160+
- For BAM file locations from pipeline outputs, see the [output documentation](https://nf-co.re/rnaseq/output)
161+
162+
This workflow is ideal for tweaking downstream processing steps (quantification methods, QC parameters, differential expression analysis) without repeating time-consuming alignment.
163+
106164
## FASTQ sampling
107165

108166
If you would like to reduce the number of reads used in the analysis, for example to test pipeline operation with limited resource usage, you can make use of the FASTP option for trimming (see below). FASTP has an option to take the first `n` reads of input FASTQ file(s), so this can be used to reduce the reads passed to subsequent steps. For example, to pass only the first 10,000 reads for trimming you would set input paramters like:

modules.json

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -108,7 +108,7 @@
108108
},
109109
"multiqc": {
110110
"branch": "master",
111-
"git_sha": "c9a31c472ef2d86802eb44f27322955849859361",
111+
"git_sha": "e10b76ca0c66213581bec2833e30d31f239dec0b",
112112
"installed_by": ["modules"]
113113
},
114114
"picard/markduplicates": {

modules/local/multiqc_custom_biotype/nextflow.config

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,4 @@
1-
if (!params.skip_alignment && !params.skip_qc) {
1+
if (!params.skip_qc) {
22
if (!params.skip_biotype_qc && params.featurecounts_group_type) {
33
process {
44
withName: 'MULTIQC_CUSTOM_BIOTYPE' {

modules/nf-core/bracken/bracken/nextflow.config

Lines changed: 2 additions & 2 deletions
Some generated files are not rendered by default. Learn more about customizing how changed files appear on GitHub.

modules/nf-core/dupradar/nextflow.config

Lines changed: 1 addition & 1 deletion
Some generated files are not rendered by default. Learn more about customizing how changed files appear on GitHub.

modules/nf-core/kraken2/kraken2/nextflow.config

Lines changed: 2 additions & 2 deletions
Some generated files are not rendered by default. Learn more about customizing how changed files appear on GitHub.

0 commit comments

Comments
 (0)