Skip to content

Commit 00a0a15

Browse files
authored
Merge pull request #1604 from nf-core/rsem_bam_input
Enable BAM input for RSEM
2 parents a1c66a3 + 91d7d26 commit 00a0a15

File tree

29 files changed

+1680
-2586
lines changed

29 files changed

+1680
-2586
lines changed

CHANGELOG.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -15,6 +15,7 @@ Special thanks to the following for their contributions to the release:
1515

1616
- [PR #1597](https://github.com/nf-core/rnaseq/pull/1597) - Bump version after release 3.20.0
1717
- [PR #1603](https://github.com/nf-core/rnaseq/pull/1603) - Add bam input pathway
18+
- [PR #1604](https://github.com/nf-core/rnaseq/pull/1604) - Enable BAM input for RSEM
1819

1920
## [[3.20.0](https://github.com/nf-core/rnaseq/releases/tag/3.20.0)] - 2025-08-18
2021

conf/test.config

Lines changed: 0 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -35,7 +35,6 @@ params {
3535
bbsplit_fasta_list = 'https://raw.githubusercontent.com/nf-core/test-datasets/626c8fab639062eade4b10747e919341cbf9b41a/reference/bbsplit_fasta_list.txt'
3636
hisat2_index = 'https://raw.githubusercontent.com/nf-core/test-datasets/626c8fab639062eade4b10747e919341cbf9b41a/reference/hisat2.tar.gz'
3737
salmon_index = 'https://raw.githubusercontent.com/nf-core/test-datasets/626c8fab639062eade4b10747e919341cbf9b41a/reference/salmon.tar.gz'
38-
rsem_index = 'https://raw.githubusercontent.com/nf-core/test-datasets/626c8fab639062eade4b10747e919341cbf9b41a/reference/rsem.tar.gz'
3938

4039
// Other parameters
4140
skip_bbsplit = false

docs/output.md

Lines changed: 5 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -11,7 +11,7 @@ nextflow run nf-core/rnaseq -profile test_full,<docker/singularity/institute>
1111
The directories listed below will be created in the results directory after the pipeline has finished. All paths are relative to the top-level results directory.
1212

1313
:::tip
14-
Many of the BAM files produced by this pipeline can be reused as input for future runs. This is particularly useful for reprocessing data or running downstream analysis steps without repeating computationally expensive alignment. See the [usage documentation](https://nf-co.re/rnaseq/usage#using-bam-files-as-input) for details on using BAM files as input.
14+
Many of the BAM files produced by this pipeline can be reused as input for future runs with `--skip_alignment`. This is particularly useful for reprocessing data or running downstream analysis steps without repeating computationally expensive alignment. See the [usage documentation](https://nf-co.re/rnaseq/usage#bam-input-for-reprocessing-workflow) for details on using BAM files as input.
1515
:::
1616

1717
## Pipeline overview
@@ -280,16 +280,16 @@ The STAR section of the MultiQC report shows a bar plot with alignment rates: go
280280
- `rsem.merged.transcript_tpm.tsv`: Matrix of isoform-level TPM values across all samples.
281281
- `*.genes.results`: RSEM gene-level quantification results for each sample.
282282
- `*.isoforms.results`: RSEM isoform-level quantification results for each sample.
283-
- `*.STAR.genome.bam`: If `--save_align_intermeds` is specified the original BAM file containing read alignments to the reference genome will be placed in this directory. These files can be reused as `genome_bam` input in future pipeline runs.
284-
- `*.transcript.bam`: If `--save_align_intermeds` is specified the original BAM file containing read alignments to the transcriptome will be placed in this directory. These files can be reused as `transcriptome_bam` input in future pipeline runs.
283+
- `*.STAR.genome.bam`: If `--save_align_intermeds` is specified the BAM file from STAR alignment containing read alignments to the reference genome will be placed in this directory. These files can be reused as `genome_bam` input in future pipeline runs.
284+
- `*.transcript.bam`: If `--save_align_intermeds` is specified the BAM file from STAR alignment containing read alignments to the transcriptome will be placed in this directory. These files can be reused as `transcriptome_bam` input in future pipeline runs.
285285
- `star_rsem/<SAMPLE>.stat/`
286286
- `*.cnt`, `*.model`, `*.theta`: RSEM counts and statistics for each sample.
287287
- `star_rsem/log/`
288288
- `*.log`: STAR alignment report containing the mapping results summary.
289289

290290
</details>
291291

292-
[RSEM](https://github.com/deweylab/RSEM) is a software package for estimating gene and isoform expression levels from RNA-seq data. It has been widely touted as one of the most accurate quantification tools for RNA-seq analysis. RSEM wraps other popular tools to map the reads to the genome (i.e. STAR, Bowtie2, HISAT2; STAR is used in this pipeline) which are then subsequently filtered relative to a transcriptome before quantifying at the gene- and isoform-level. Other advantages of using RSEM are that it performs both the alignment and quantification in a single package and its ability to effectively use ambiguously-mapping reads.
292+
[RSEM](https://github.com/deweylab/RSEM) is a software package for estimating gene and isoform expression levels from RNA-seq data. It has been widely touted as one of the most accurate quantification tools for RNA-seq analysis. When using `--aligner star_rsem`, the pipeline first runs STAR alignment with RSEM-compatible parameters to generate genome and transcriptome BAM files, then RSEM quantifies expression using these pre-aligned BAMs via the `--alignments` mode. This approach ensures optimal compatibility while maintaining RSEM's ability to effectively use ambiguously-mapping reads.
293293

294294
You can choose to align and quantify your data with RSEM by providing the `--aligner star_rsem` parameter.
295295

@@ -869,7 +869,7 @@ A number of genome-specific files are generated by the pipeline because they are
869869
- Reformatted samplesheet files used as input to the pipeline: `samplesheet.valid.csv`.
870870
- Parameters used by the pipeline run: `params.json`.
871871
- `samplesheets/`
872-
- `samplesheet_with_bams.csv`: **Auto-generated complete samplesheet** (only created when using `--save_align_intermeds`) containing all samples with BAM file paths. For samples processed from FASTQ, includes paths to newly generated BAMs; for samples that were BAM input, preserves the original input paths. This comprehensive samplesheet can be used directly for future pipeline runs, enabling efficient reprocessing without re-alignment.
872+
- `samplesheet_with_bams.csv`: **Auto-generated samplesheet for BAM reprocessing** (only created when using `--save_align_intermeds`) containing all samples with BAM file paths. For samples processed from FASTQ, includes paths to newly generated BAMs; for samples that were BAM input, preserves the original input paths. This samplesheet can be used directly for future pipeline runs with `--skip_alignment`, enabling efficient reprocessing without re-alignment.
873873

874874
</details>
875875

docs/usage.md

Lines changed: 8 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -136,7 +136,7 @@ nextflow run nf-core/rnaseq \
136136
-profile docker
137137
```
138138

139-
The pipeline will skip alignment and indexing steps, putting the BAM files through post-processing and quantification only.
139+
The `--skip_alignment` flag tells the pipeline to skip alignment, and in this situation it will use any provided BAM files instead of performing alignment, putting them through post-processing and quantification only.
140140

141141
#### Example of generated samplesheet
142142

@@ -152,11 +152,15 @@ SAMPLE2,/path/sample2_R1.fastq.gz,,reverse,results/star_salmon/SAMPLE2.sorted.ba
152152

153153
> **⚠️ Warning**: This feature is designed specifically for BAM files generated by this pipeline. Using arbitrary BAM files from other sources is **not officially supported** and will likely only work via the two-step workflow described above. Users attempting to use other BAMs do so at their own risk.
154154
155+
> **⚠️ Warning**: You cannot mix quantifier types between BAM generation and reprocessing runs. BAM files generated with `--aligner star_salmon` must be reprocessed with `--aligner star_salmon`. Similarly, BAM files from `--aligner star_rsem` must be reprocessed with `--aligner star_rsem`. Mixing quantifier types will likely produce incorrect results due to incompatible alignment parameters.
156+
155157
**Key technical details:**
156158

159+
- BAM files are only used when `--skip_alignment` is specified
157160
- The pipeline automatically indexes provided BAM files
158161
- You can provide just `genome_bam`, just `transcriptome_bam`, or both
159-
- Mixed samplesheets (some samples with FASTQ, others with BAM) are supported
162+
- Mixed samplesheets are supported, but samples with BAM files require `--skip_alignment`
163+
- Without `--skip_alignment`, the pipeline will perform alignment even if BAM files are provided
160164
- For BAM file locations from pipeline outputs, see the [output documentation](https://nf-co.re/rnaseq/output)
161165

162166
This workflow is ideal for tweaking downstream processing steps (quantification methods, QC parameters, differential expression analysis) without repeating time-consuming alignment.
@@ -181,7 +185,7 @@ If you would like to reduce the number of reads used in the analysis, for exampl
181185
The `--aligner hisat2` option is not currently supported using ARM architecture ('-profile arm')
182186
:::
183187

184-
By default, the pipeline uses [STAR](https://github.com/alexdobin/STAR) (i.e. `--aligner star_salmon`) to map the raw FastQ reads to the reference genome, project the alignments onto the transcriptome and to perform the downstream BAM-level quantification with [Salmon](https://salmon.readthedocs.io/en/latest/salmon.html). STAR is fast but requires a lot of memory to run, typically around 38GB for the Human GRCh37 reference genome. Since the [RSEM](https://github.com/deweylab/RSEM) (i.e. `--aligner star_rsem`) workflow in the pipeline also uses STAR you should use the [HISAT2](https://ccb.jhu.edu/software/hisat2/index.shtml) aligner (i.e. `--aligner hisat2`) if you have memory limitations.
188+
By default, the pipeline uses [STAR](https://github.com/alexdobin/STAR) (i.e. `--aligner star_salmon`) to map the raw FastQ reads to the reference genome, project the alignments onto the transcriptome and to perform the downstream BAM-level quantification with [Salmon](https://salmon.readthedocs.io/en/latest/salmon.html). STAR is fast but requires a lot of memory to run, typically around 38GB for the Human GRCh37 reference genome. Both `--aligner star_salmon` and `--aligner star_rsem` use STAR for alignment, so you should use the [HISAT2](https://ccb.jhu.edu/software/hisat2/index.shtml) aligner (i.e. `--aligner hisat2`) if you have memory limitations.
185189

186190
You also have the option to pseudoalign and quantify your data directly with [Salmon](https://salmon.readthedocs.io/en/latest/salmon.html) or [Kallisto](https://pachterlab.github.io/kallisto/) by specifying `salmon` or `kallisto` to the `--pseudo_aligner` parameter. The selected pseudoaligner will then be run in addition to the standard alignment workflow defined by `--aligner`, mainly because it allows you to obtain QC metrics with respect to the genomic alignments. However, you can provide the `--skip_alignment` parameter if you would like to run Salmon or Kallisto in isolation. By default, the pipeline will use the genome fasta and gtf file to generate the transcripts fasta file, and then to build the Salmon index. You can override these parameters using the `--transcript_fasta` and `--salmon_index` parameters, respectively.
187191

@@ -313,7 +317,7 @@ Notes:
313317
- If `--gff` is provided as input then this will be converted to a GTF file, or the latter will be used if both are provided.
314318
- If `--gene_bed` is not provided then it will be generated from the GTF file.
315319
- If `--additional_fasta` is provided then the features in this file (e.g. ERCC spike-ins) will be automatically concatenated onto both the reference FASTA file as well as the GTF annotation before building the appropriate indices.
316-
- When using `--aligner star_rsem`, both the STAR and RSEM indices should be present in the path specified by `--rsem_index` (see [#568](https://github.com/nf-core/rnaseq/issues/568)).
320+
- When using `--aligner star_rsem`, the pipeline will build separate STAR and RSEM indices. STAR performs alignment with RSEM-compatible parameters, then RSEM quantifies from the resulting BAM files using `--alignments` mode.
317321
- If the `--skip_alignment` option is used along with `--transcript_fasta`, the pipeline can technically run without providing the genomic FASTA (`--fasta`). However, this approach is **not recommended** with `--pseudo_aligner salmon`, as any dynamically generated Salmon index will lack decoys. To ensure optimal indexing with decoys, it is **highly recommended** to include the genomic FASTA (`--fasta`) with Salmon, unless a pre-existing decoy-aware Salmon index is supplied. For more details on the benefits of decoy-aware indexing, refer to the [Salmon documentation](https://salmon.readthedocs.io/en/latest/salmon.html#preparing-transcriptome-indices-mapping-based-mode).
318322

319323
#### Reference genome

modules.json

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -129,7 +129,7 @@
129129
},
130130
"rsem/calculateexpression": {
131131
"branch": "master",
132-
"git_sha": "20b042e352fc47ab6dab717a622253e96429e887",
132+
"git_sha": "82cd92d50025a01e1370758ae18fcfe708b6d28c",
133133
"installed_by": ["modules"]
134134
},
135135
"rsem/preparereference": {
@@ -219,7 +219,7 @@
219219
},
220220
"sentieon/rsemcalculateexpression": {
221221
"branch": "master",
222-
"git_sha": "2779d18605e9923332155d671f45ed37fa185ff4",
222+
"git_sha": "d9cd3c825e2d05f9c851130100018ae02a766510",
223223
"installed_by": ["modules"]
224224
},
225225
"sentieon/rsempreparereference": {

modules/nf-core/rsem/calculateexpression/main.nf

Lines changed: 31 additions & 5 deletions
Some generated files are not rendered by default. Learn more about customizing how changed files appear on GitHub.

modules/nf-core/rsem/calculateexpression/meta.yml

Lines changed: 38 additions & 25 deletions
Some generated files are not rendered by default. Learn more about customizing how changed files appear on GitHub.

modules/nf-core/rsem/calculateexpression/tests/alignment.config

Lines changed: 5 additions & 0 deletions
Some generated files are not rendered by default. Learn more about customizing how changed files appear on GitHub.

0 commit comments

Comments
 (0)