Skip to content

Commit 2e6bdae

Browse files
author
Pablo Riesgo Ferreiro
committed
Merge branch 'develop' into 'master'
Release v1.2.0 See merge request tron/tron-bam-preprocessing!8
2 parents 930e294 + 53e9d78 commit 2e6bdae

File tree

8 files changed

+117
-86
lines changed

8 files changed

+117
-86
lines changed

Makefile

Lines changed: 8 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -8,10 +8,11 @@ clean:
88
rm -rf .nextflow*
99

1010
test:
11-
#nextflow main.nf -profile test,conda --output output/test1
12-
#nextflow main.nf -profile test,conda --skip_bqsr --output output/test2
13-
#nextflow main.nf -profile test,conda --skip_realignment --output output/test3
14-
#nextflow main.nf -profile test,conda --skip_deduplication --output output/test4
15-
#nextflow main.nf -profile test,conda --output output/test5 --skip_metrics
16-
#nextflow main.nf -profile test,conda --output output/test6 --intervals false
17-
nextflow main.nf -profile test,conda --output output/test6 --hs_metrics_target_coverage target_coverage.txt --hs_metrics_per_base_coverage per_base_coverage.txt
11+
nextflow main.nf -profile test,conda --output output/test1
12+
nextflow main.nf -profile test,conda --skip_bqsr --output output/test2
13+
nextflow main.nf -profile test,conda --skip_realignment --output output/test3
14+
nextflow main.nf -profile test,conda --skip_deduplication --output output/test4
15+
nextflow main.nf -profile test,conda --output output/test5 --skip_metrics
16+
nextflow main.nf -profile test,conda --output output/test6 --intervals false
17+
nextflow main.nf -profile test,conda --output output/test7 --hs_metrics_target_coverage target_coverage.txt --hs_metrics_per_base_coverage per_base_coverage.txt
18+
nextflow main.nf -profile test,conda --output output/test8 --hs_metrics_target_coverage target_coverage.txt --hs_metrics_per_base_coverage per_base_coverage.txt --collect_hs_metrics_min_base_quality 10 --collect_hs_metrics_min_mapping_quality 10

README.md

Lines changed: 18 additions & 14 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,7 @@
11
# TRONflow BAM preprocessing pipeline
22

3+
[![DOI](https://zenodo.org/badge/358400957.svg)](https://zenodo.org/badge/latestdoi/358400957)
4+
35
Nextflow pipeline for the preprocessing of BAM files based on Picard and GATK.
46

57

@@ -32,7 +34,9 @@ Steps:
3234

3335
## References
3436

35-
The bam preprocessing workflow use some required references (`--reference`, `--dbsnp`, `--known_indels1` and `--known_indels2`).
37+
The bam preprocessing workflow requires the human reference genome (`--reference`)
38+
Base Quality Score Recalibration (BQSR) requires dbSNP to avoid extracting error metrics from polymorphic sites (`--dbsnp`)
39+
Realignment around indels requires a set of known indels (`--known_indels1` and `--known_indels2`).
3640
These resources can be fetched from the GATK bundle https://gatk.broadinstitute.org/hc/en-us/articles/360035890811-Resource-bundle.
3741

3842
Optionally, in order to run Picard's CollectHsMetrics an intervals file will need to be provided (`--intervals`).
@@ -41,12 +45,13 @@ This can be built from a BED file using Picard's BedToIntervalList (https://gatk
4145
## How to run it
4246

4347
```
44-
$ nextflow run tron-bioinformatics/tronflow-bam-preprocessing -r v1.1.0 --help
48+
$ nextflow run tron-bioinformatics/tronflow-bam-preprocessing -r v1.2.0 --help
4549
N E X T F L O W ~ version 19.07.0
4650
Launching `main.nf` [intergalactic_shannon] - revision: e707c77d7b
51+
4752
Usage:
4853
main.nf --input_files input_files
49-
54+
5055
Input:
5156
* --input_files: the path to a tab-separated values file containing in each row the sample name, sample type (eg: tumor or normal) and path to the BAM file
5257
Sample type will be added to the BAM header @SN sample name
@@ -55,25 +60,24 @@ Input:
5560
name1 tumor tumor.1.bam
5661
name1 normal normal.1.bam
5762
name2 tumor tumor.2.bam
58-
59-
Optional input:
6063
* --reference: path to the FASTA genome reference (indexes expected *.fai, *.dict)
61-
* --dbsnp: path to the dbSNP VCF
62-
* --known_indels1: path to a VCF of known indels
63-
* --known_indels2: path to a second VCF of known indels
64-
**NOTE**: if any of the above parameters is not provided, default hg19 resources under
65-
/projects/data/gatk_bundle/hg19/ will be used
66-
64+
65+
Optional input:
66+
* --dbsnp: path to the dbSNP VCF (required to perform BQSR)
67+
* --known_indels1: path to a VCF of known indels (optional to perform realignment around indels)
68+
* --known_indels2: path to a second VCF of known indels (optional to perform realignment around indels)
6769
* --intervals: path to an intervals file to collect HS metrics from, this can be built with Picard's BedToIntervalList (default: None)
6870
* --hs_metrics_target_coverage: name of output file for target HS metrics (default: None)
6971
* --hs_metrics_per_base_coverage: name of output file for per base HS metrics (default: None)
72+
* --collect_hs_minimum_base_quality: minimum base quality for a base to contribute coverage (default: 20).
73+
* --collect_hs_minimum_mapping_quality: minimum mapping quality for a read to contribute coverage (default: 20).
7074
* --skip_bqsr: optionally skip BQSR (default: false)
7175
* --skip_realignment: optionally skip realignment (default: false)
7276
* --skip_deduplication: optionally skip deduplication (default: false)
7377
* --skip_metrics: optionally skip metrics (default: false)
7478
* --output: the folder where to publish output (default: ./output)
7579
* --platform: the platform to be added to the BAM header. Valid values: [ILLUMINA, SOLID, LS454, HELICOS and PACBIO] (default: ILLUMINA)
76-
80+
7781
Computational resources:
7882
* --prepare_bam_cpus: (default: 3)
7983
* --prepare_bam_memory: (default: 8g)
@@ -83,11 +87,11 @@ Computational resources:
8387
* --realignment_around_indels_memory: (default: 32g)
8488
* --bqsr_cpus: (default: 3)
8589
* --bqsr_memory: (default: 4g)
86-
90+
8791
Output:
8892
* Preprocessed and indexed BAMs
8993
* Tab-separated values file with the absolute paths to the preprocessed BAMs, preprocessed_bams.txt
90-
94+
9195
Optional output:
9296
* Recalibration report
9397
* Realignment intervals

environment.yml

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,10 +1,11 @@
11
# You can use this file to create a conda environment for this pipeline:
22
# conda env create -f environment.yml
3-
name: tronflow-bam-preprocessing-1.1.0
3+
name: tronflow-bam-preprocessing-1.2.0
44
channels:
55
- conda-forge
66
- bioconda
77
- defaults
88
dependencies:
9+
- openjdk=8.0.282
910
- bioconda::gatk4=4.2.0.0
1011
- bioconda::gatk=3.8

main.nf

Lines changed: 30 additions & 61 deletions
Original file line numberDiff line numberDiff line change
@@ -3,10 +3,10 @@
33
publish_dir = 'output'
44
params.help= false
55
params.input_files = false
6-
params.reference = "/projects/data/gatk_bundle/hg19/ucsc.hg19.fasta"
7-
params.dbsnp = "/projects/data/gatk_bundle/hg19/dbsnp_138.hg19.vcf"
8-
params.known_indels1 = "/projects/data/gatk_bundle/hg19/1000G_phase1.indels.hg19.sites.vcf"
9-
params.known_indels2 = "/projects/data/gatk_bundle/hg19/Mills_and_1000G_gold_standard.indels.hg19.sites.sorted.vcf"
6+
params.reference = false
7+
params.dbsnp = false
8+
params.known_indels1 = false
9+
params.known_indels2 = false
1010
params.intervals = false
1111
params.hs_metrics_target_coverage = false
1212
params.hs_metrics_per_base_coverage = false
@@ -16,7 +16,10 @@ params.skip_deduplication = false
1616
params.skip_metrics = false
1717
params.output = false
1818
params.platform = "ILLUMINA"
19+
params.collect_hs_metrics_min_base_quality = false
20+
params.collect_hs_metrics_min_mapping_quality = false
1921

22+
// computational resources
2023
params.prepare_bam_cpus = 3
2124
params.prepare_bam_memory = "8g"
2225
params.mark_duplicates_cpus = 16
@@ -29,63 +32,22 @@ params.bqsr_memory = "4g"
2932

3033

3134
def helpMessage() {
32-
log.info"""
33-
Usage:
34-
main.nf --input_files input_files
35-
36-
Input:
37-
* --input_files: the path to a tab-separated values file containing in each row the sample name, sample type (eg: tumor or normal) and path to the BAM file
38-
Sample type will be added to the BAM header @SN sample name
39-
The input file does not have header!
40-
Example input file:
41-
name1 tumor tumor.1.bam
42-
name1 normal normal.1.bam
43-
name2 tumor tumor.2.bam
44-
45-
Optional input:
46-
* --reference: path to the FASTA genome reference (indexes expected *.fai, *.dict)
47-
* --dbsnp: path to the dbSNP VCF
48-
* --known_indels1: path to a VCF of known indels
49-
* --known_indels2: path to a second VCF of known indels
50-
**NOTE**: if any of the above parameters is not provided, default hg19 resources under
51-
/projects/data/gatk_bundle/hg19/ will be used
52-
53-
* --intervals: path to an intervals file to collect HS metrics from, this can be built with Picard's BedToIntervalList (default: None)
54-
* --hs_metrics_target_coverage: name of output file for target HS metrics (default: None)
55-
* --hs_metrics_per_base_coverage: name of output file for per base HS metrics (default: None)
56-
* --skip_bqsr: optionally skip BQSR (default: false)
57-
* --skip_realignment: optionally skip realignment (default: false)
58-
* --skip_deduplication: optionally skip deduplication (default: false)
59-
* --skip_metrics: optionally skip metrics (default: false)
60-
* --output: the folder where to publish output (default: ./output)
61-
* --platform: the platform to be added to the BAM header. Valid values: [ILLUMINA, SOLID, LS454, HELICOS and PACBIO] (default: ILLUMINA)
62-
63-
Computational resources:
64-
* --prepare_bam_cpus: (default: 3)
65-
* --prepare_bam_memory: (default: 8g)
66-
* --mark_duplicates_cpus: (default: 16)
67-
* --mark_duplicates_memory: (default: 64g)
68-
* --realignment_around_indels_cpus: (default: 2)
69-
* --realignment_around_indels_memory: (default: 32g)
70-
* --bqsr_cpus: (default: 3)
71-
* --bqsr_memory: (default: 4g)
72-
73-
Output:
74-
* Preprocessed and indexed BAMs
75-
* Tab-separated values file with the absolute paths to the preprocessed BAMs, preprocessed_bams.txt
76-
77-
Optional output:
78-
* Recalibration report
79-
* Realignment intervals
80-
* Metrics
81-
"""
35+
log.info params.help_message
8236
}
8337

8438
if (params.help) {
8539
helpMessage()
8640
exit 0
8741
}
8842

43+
if (!params.reference) {
44+
exit -1, "--reference is required"
45+
}
46+
47+
if (!params.skip_bqsr && !params.dbsnp) {
48+
exit -1, "--dbsnp is required to perform BQSR"
49+
}
50+
8951
if (params.output) {
9052
publish_dir = params.output
9153
}
@@ -168,7 +130,7 @@ if (!params.skip_deduplication) {
168130
file("${bam.baseName}.dedup_metrics") optional true into deduplication_metrics
169131

170132
script:
171-
dedup_metrics = params.skip_metrics ? "--metrics-file ${bam.baseName}.dedup_metrics" : ""
133+
dedup_metrics = params.skip_metrics ? "": "--metrics-file ${bam.baseName}.dedup_metrics"
172134
"""
173135
mkdir tmp
174136
@@ -211,6 +173,10 @@ if (! params.skip_metrics) {
211173
hs_metrics_per_base_coverage= params.hs_metrics_per_base_coverage ?
212174
"--PER_BASE_COVERAGE ${params.hs_metrics_per_base_coverage}" :
213175
""
176+
minimum_base_quality = params.collect_hs_metrics_min_base_quality ?
177+
"--MINIMUM_BASE_QUALITY ${params.collect_hs_metrics_min_base_quality}" : ""
178+
minimum_mapping_quality = params.collect_hs_metrics_min_mapping_quality ?
179+
"--MINIMUM_MAPPING_QUALITY ${params.collect_hs_metrics_min_mapping_quality}" : ""
214180
"""
215181
mkdir tmp
216182
@@ -220,7 +186,7 @@ if (! params.skip_metrics) {
220186
--OUTPUT ${bam.baseName} \
221187
--TARGET_INTERVALS ${params.intervals} \
222188
--BAIT_INTERVALS ${params.intervals} \
223-
${hs_metrics_target_coverage} ${hs_metrics_per_base_coverage}
189+
${hs_metrics_target_coverage} ${hs_metrics_per_base_coverage} ${minimum_base_quality} ${minimum_mapping_quality}
224190
"""
225191
}
226192
}
@@ -272,26 +238,29 @@ if (!params.skip_realignment) {
272238
set val(name), val(bam_name), val(type), file("${bam.baseName}.realigned.bam"), file("${bam.baseName}.realigned.bai") into realigned_bams
273239
file("${bam.baseName}.RA.intervals") into realignment_intervals
274240

241+
script:
242+
known_indels = "" + params.known_indels1 ? " --known ${params.known_indels1}" : "" +
243+
params.known_indels2 ? " --known ${params.known_indels2}" : ""
244+
known_alleles = "" + params.known_indels1 ? " --knownAlleles ${params.known_indels1}" : "" +
245+
params.known_indels2 ? " --knownAlleles ${params.known_indels2}" : ""
275246
"""
276247
mkdir tmp
277248
278249
gatk3 -Xmx${params.realignment_around_indels_memory} -Djava.io.tmpdir=tmp -T RealignerTargetCreator \
279250
--input_file ${bam} \
280251
--out ${bam.baseName}.RA.intervals \
281252
--reference_sequence ${params.reference} \
282-
--known ${params.known_indels1} \
283-
--known ${params.known_indels2}
253+
${known_indels}
284254
285255
gatk3 -Xmx${params.realignment_around_indels_memory} -Djava.io.tmpdir=tmp -T IndelRealigner \
286256
--input_file ${bam} \
287257
--out ${bam.baseName}.realigned.bam \
288258
--reference_sequence ${params.reference} \
289259
--targetIntervals ${bam.baseName}.RA.intervals \
290-
--knownAlleles ${params.known_indels1} \
291-
--knownAlleles ${params.known_indels2} \
292260
--consensusDeterminationModel USE_SW \
293261
--LODThresholdForCleaning 0.4 \
294-
--maxReadsInMemory 600000
262+
--maxReadsInMemory 600000 \
263+
${known_alleles}
295264
"""
296265
}
297266
}

nextflow.config

Lines changed: 57 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -55,12 +55,68 @@ dag {
5555
//file = "${params.output}/pipeline_dag.svg"
5656
}
5757

58+
VERSION = '1.2.0'
59+
DOI = 'https://zenodo.org/badge/latestdoi/358400957'
60+
5861
manifest {
5962
name = 'TRON-Bioinformatics/tronflow-bam-preprocessing'
6063
author = 'Pablo Riesgo Ferreiro'
6164
homePage = 'https://github.com/TRON-Bioinformatics/tronflow-bam-preprocessing'
6265
description = 'Picard and GATK BAM preprocessing pipeline'
6366
mainScript = 'main.nf'
6467
nextflowVersion = '>=19.10.0'
65-
version = '1.1.0'
68+
version = VERSION
69+
doi = DOI
6670
}
71+
72+
params.help_message = """
73+
TronFlow bam preprocessing v${VERSION} ${DOI}
74+
75+
Usage:
76+
main.nf --input_files input_files
77+
78+
Input:
79+
* --input_files: the path to a tab-separated values file containing in each row the sample name, sample type (eg: tumor or normal) and path to the BAM file
80+
Sample type will be added to the BAM header @SN sample name
81+
The input file does not have header!
82+
Example input file:
83+
name1 tumor tumor.1.bam
84+
name1 normal normal.1.bam
85+
name2 tumor tumor.2.bam
86+
* --reference: path to the FASTA genome reference (indexes expected *.fai, *.dict)
87+
88+
Optional input:
89+
* --dbsnp: path to the dbSNP VCF (required to perform BQSR)
90+
* --known_indels1: path to a VCF of known indels (optional to perform realignment around indels)
91+
* --known_indels2: path to a second VCF of known indels (optional to perform realignment around indels)
92+
* --intervals: path to an intervals file to collect HS metrics from, this can be built with Picard's BedToIntervalList (default: None)
93+
* --hs_metrics_target_coverage: name of output file for target HS metrics (default: None)
94+
* --hs_metrics_per_base_coverage: name of output file for per base HS metrics (default: None)
95+
* --collect_hs_minimum_base_quality: minimum base quality for a base to contribute coverage (default: 20).
96+
* --collect_hs_minimum_mapping_quality: minimum mapping quality for a read to contribute coverage (default: 20).
97+
* --skip_bqsr: optionally skip BQSR (default: false)
98+
* --skip_realignment: optionally skip realignment (default: false)
99+
* --skip_deduplication: optionally skip deduplication (default: false)
100+
* --skip_metrics: optionally skip metrics (default: false)
101+
* --output: the folder where to publish output (default: ./output)
102+
* --platform: the platform to be added to the BAM header. Valid values: [ILLUMINA, SOLID, LS454, HELICOS and PACBIO] (default: ILLUMINA)
103+
104+
Computational resources:
105+
* --prepare_bam_cpus: (default: 3)
106+
* --prepare_bam_memory: (default: 8g)
107+
* --mark_duplicates_cpus: (default: 16)
108+
* --mark_duplicates_memory: (default: 64g)
109+
* --realignment_around_indels_cpus: (default: 2)
110+
* --realignment_around_indels_memory: (default: 32g)
111+
* --bqsr_cpus: (default: 3)
112+
* --bqsr_memory: (default: 4g)
113+
114+
Output:
115+
* Preprocessed and indexed BAMs
116+
* Tab-separated values file with the absolute paths to the preprocessed BAMs, preprocessed_bams.txt
117+
118+
Optional output:
119+
* Recalibration report
120+
* Realignment intervals
121+
* Metrics
122+
"""
File renamed without changes.
File renamed without changes.

test_data/test_input.txt

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,2 +1,2 @@
1-
TESTX_H7YRLADXX_S1_L001 tumor test_data/TESTX_H7YRLADXX_S1_L001.bam
2-
TESTX_H7YRLADXX_S1_L002 normal test_data/TESTX_H7YRLADXX_S1_L002.bam
1+
TESTX_S1_L001 tumor test_data/TESTX_S1_L001.bam
2+
TESTX_S1_L002 normal test_data/TESTX_S1_L002.bam

0 commit comments

Comments
 (0)