Skip to content

Commit 930e294

Browse files
author
Pablo Riesgo Ferreiro
committed
Merge branch 'develop' into 'master'
Release v1.1.0 See merge request tron/tron-bam-preprocessing!6
2 parents 7b22cb6 + 20062b5 commit 930e294

File tree

7 files changed

+881
-69
lines changed

7 files changed

+881
-69
lines changed

Makefile

Lines changed: 7 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -8,7 +8,10 @@ clean:
88
rm -rf .nextflow*
99

1010
test:
11-
nextflow main.nf -profile test,conda --output output/test1
12-
nextflow main.nf -profile test,conda --skip_bqsr --output output/test2
13-
nextflow main.nf -profile test,conda --skip_realignment --output output/test3
14-
nextflow main.nf -profile test,conda --skip_deduplication --output output/test4
11+
#nextflow main.nf -profile test,conda --output output/test1
12+
#nextflow main.nf -profile test,conda --skip_bqsr --output output/test2
13+
#nextflow main.nf -profile test,conda --skip_realignment --output output/test3
14+
#nextflow main.nf -profile test,conda --skip_deduplication --output output/test4
15+
#nextflow main.nf -profile test,conda --output output/test5 --skip_metrics
16+
#nextflow main.nf -profile test,conda --output output/test6 --intervals false
17+
nextflow main.nf -profile test,conda --output output/test6 --hs_metrics_target_coverage target_coverage.txt --hs_metrics_per_base_coverage per_base_coverage.txt

README.md

Lines changed: 41 additions & 23 deletions
Original file line numberDiff line numberDiff line change
@@ -23,23 +23,32 @@ Steps:
2323
* **Clean BAM**. Sets the mapping quality to 0 for all unmapped reads and avoids soft clipping going beyond the reference genome boundaries. Implemented in Picard
2424
* **Reorder chromosomes**. Makes the chromosomes in the BAM follow the same order as the reference genome. Implemented in Picard
2525
* **Add read groups**. GATK requires that some headers are adde to the BAM, also we want to flag somehow the normal and tumor BAMs in the header as some callers, such as Mutect2 require it. Implemented in Picard.
26-
* **Mark duplicates** (optional). Identify the PCR and the optical duplications and marks those reads. This uses the parallelized version on Spark, it is reported to scale linearly up to 16 CPUs.
27-
* **Realignment around indels** (optional). This procedure is important for locus based variant callers, but for any variant caller doing haplotype assembly it is not needed. This is computing intensive as it first finds regions for realignment where there are indication of indels and then it performs a local realignment over those regions. Implemented in GATK3, deprecated in GATK4
28-
* **Base Quality Score Recalibration (BQSR)** (optional). It aims at correcting systematic errors in the sequencer when assigning the base call quality errors, as these scores are used by variant callers it improves variant calling in some situations. Implemented in GATK4
26+
* **Mark duplicates** (optional). Identify the PCR and the optical duplications and marks those reads. This uses the parallelized version on Spark, it is reported to scale linearly up to 16 CPUs.
27+
* **Realignment around indels** (optional). This procedure is important for locus based variant callers, but for any variant caller doing haplotype assembly it is not needed. This is computing intensive as it first finds regions for realignment where there are indication of indels and then it performs a local realignment over those regions. Implemented in GATK3, deprecated in GATK4
28+
* **Base Quality Score Recalibration (BQSR)** (optional). It aims at correcting systematic errors in the sequencer when assigning the base call quality errors, as these scores are used by variant callers it improves variant calling in some situations. Implemented in GATK4
29+
* **Metrics** (optional). A number of metrics are obtained over the BAM file with Picard's CollectMetrics (eg: duplication, insert size, alignment, etc.).
2930

3031
![Pipeline](bam_preprocessing2.png)
3132

33+
## References
34+
35+
The bam preprocessing workflow use some required references (`--reference`, `--dbsnp`, `--known_indels1` and `--known_indels2`).
36+
These resources can be fetched from the GATK bundle https://gatk.broadinstitute.org/hc/en-us/articles/360035890811-Resource-bundle.
37+
38+
Optionally, in order to run Picard's CollectHsMetrics an intervals file will need to be provided (`--intervals`).
39+
This can be built from a BED file using Picard's BedToIntervalList (https://gatk.broadinstitute.org/hc/en-us/articles/360036883931-BedToIntervalList-Picard-)
40+
3241
## How to run it
3342

3443
```
35-
$ nextflow run tron-bioinformatics/tronflow-bam-preprocessing -r v1.0.0 --help
44+
$ nextflow run tron-bioinformatics/tronflow-bam-preprocessing -r v1.1.0 --help
3645
N E X T F L O W ~ version 19.07.0
3746
Launching `main.nf` [intergalactic_shannon] - revision: e707c77d7b
3847
Usage:
3948
main.nf --input_files input_files
4049
4150
Input:
42-
* input_files: the path to a tab-separated values file containing in each row the sample name, sample type (eg: tumor or normal) and path to the BAM file
51+
* --input_files: the path to a tab-separated values file containing in each row the sample name, sample type (eg: tumor or normal) and path to the BAM file
4352
Sample type will be added to the BAM header @SN sample name
4453
The input file does not have header!
4554
Example input file:
@@ -48,23 +57,32 @@ Input:
4857
name2 tumor tumor.2.bam
4958
5059
Optional input:
51-
* reference: path to the FASTA genome reference (indexes expected *.fai, *.dict)
52-
* dbsnp: path to the dbSNP VCF
53-
* known_indels1: path to a VCF of known indels
54-
* known_indels2: path to a second VCF of known indels
55-
* NOTE: if any of the above parameters is not provided, default hg19 resources will be used
56-
* skip_bqsr: optionally skip BQSR
57-
* skip_realignment: optionally skip realignment
58-
* skip_deduplication: optionally skip deduplication
59-
* output: the folder where to publish output, if not provided they will be moved to "output" folder inside the workflow folder* prepare_bam_cpus: default 3
60-
* platform: the platform to be added to the BAM header. Valid values: [ILLUMINA, SOLID, LS454, HELICOS and PACBIO] (default: ILLUMINA)
61-
* prepare_bam_memory: default 8g
62-
* mark_duplicates_cpus: default 16
63-
* mark_duplicates_memory: default 64g
64-
* realignment_around_indels_cpus: default 2
65-
* realignment_around_indels_memory: default 32g
66-
* bqsr_cpus: default 3
67-
* bqsr_memory: default 4g
60+
* --reference: path to the FASTA genome reference (indexes expected *.fai, *.dict)
61+
* --dbsnp: path to the dbSNP VCF
62+
* --known_indels1: path to a VCF of known indels
63+
* --known_indels2: path to a second VCF of known indels
64+
**NOTE**: if any of the above parameters is not provided, default hg19 resources under
65+
/projects/data/gatk_bundle/hg19/ will be used
66+
67+
* --intervals: path to an intervals file to collect HS metrics from, this can be built with Picard's BedToIntervalList (default: None)
68+
* --hs_metrics_target_coverage: name of output file for target HS metrics (default: None)
69+
* --hs_metrics_per_base_coverage: name of output file for per base HS metrics (default: None)
70+
* --skip_bqsr: optionally skip BQSR (default: false)
71+
* --skip_realignment: optionally skip realignment (default: false)
72+
* --skip_deduplication: optionally skip deduplication (default: false)
73+
* --skip_metrics: optionally skip metrics (default: false)
74+
* --output: the folder where to publish output (default: ./output)
75+
* --platform: the platform to be added to the BAM header. Valid values: [ILLUMINA, SOLID, LS454, HELICOS and PACBIO] (default: ILLUMINA)
76+
77+
Computational resources:
78+
* --prepare_bam_cpus: (default: 3)
79+
* --prepare_bam_memory: (default: 8g)
80+
* --mark_duplicates_cpus: (default: 16)
81+
* --mark_duplicates_memory: (default: 64g)
82+
* --realignment_around_indels_cpus: (default: 2)
83+
* --realignment_around_indels_memory: (default: 32g)
84+
* --bqsr_cpus: (default: 3)
85+
* --bqsr_memory: (default: 4g)
6886
6987
Output:
7088
* Preprocessed and indexed BAMs
@@ -73,5 +91,5 @@ Optional input:
7391
Optional output:
7492
* Recalibration report
7593
* Realignment intervals
76-
* Duplication metrics
94+
* Metrics
7795
```

environment.yml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
# You can use this file to create a conda environment for this pipeline:
22
# conda env create -f environment.yml
3-
name: tronflow-bam-preprocessing-1.0.1
3+
name: tronflow-bam-preprocessing-1.1.0
44
channels:
55
- conda-forge
66
- bioconda

main.nf

Lines changed: 121 additions & 40 deletions
Original file line numberDiff line numberDiff line change
@@ -7,9 +7,13 @@ params.reference = "/projects/data/gatk_bundle/hg19/ucsc.hg19.fasta"
77
params.dbsnp = "/projects/data/gatk_bundle/hg19/dbsnp_138.hg19.vcf"
88
params.known_indels1 = "/projects/data/gatk_bundle/hg19/1000G_phase1.indels.hg19.sites.vcf"
99
params.known_indels2 = "/projects/data/gatk_bundle/hg19/Mills_and_1000G_gold_standard.indels.hg19.sites.sorted.vcf"
10+
params.intervals = false
11+
params.hs_metrics_target_coverage = false
12+
params.hs_metrics_per_base_coverage = false
1013
params.skip_bqsr = false
1114
params.skip_realignment = false
1215
params.skip_deduplication = false
16+
params.skip_metrics = false
1317
params.output = false
1418
params.platform = "ILLUMINA"
1519

@@ -27,45 +31,53 @@ params.bqsr_memory = "4g"
2731
def helpMessage() {
2832
log.info"""
2933
Usage:
30-
bam_preprocessing.nf --input_files input_files --reference reference.fasta
34+
main.nf --input_files input_files
3135
3236
Input:
33-
* input_files: the path to a tab-separated values file containing in each row the sample name, sample type (tumor or normal) and path to the BAM file
37+
* --input_files: the path to a tab-separated values file containing in each row the sample name, sample type (eg: tumor or normal) and path to the BAM file
3438
Sample type will be added to the BAM header @SN sample name
3539
The input file does not have header!
3640
Example input file:
37-
name1 tumor tumor.1.bam
38-
name1 normal normal.1.bam
39-
name2 tumor tumor.2.bam
41+
name1 tumor tumor.1.bam
42+
name1 normal normal.1.bam
43+
name2 tumor tumor.2.bam
4044
4145
Optional input:
42-
* reference: path to the FASTA genome reference (indexes expected *.fai, *.dict)
43-
* dbsnp: path to the dbSNP VCF
44-
* known_indels1: path to a VCF of known indels
45-
* known_indels2: path to a second VCF of known indels
46-
* NOTE: if any of the above parameters is not provided, default hg19 resources will be used
47-
* skip_bqsr: optionally skip BQSR
48-
* skip_realignment: optionally skip realignment
49-
* skip_deduplication: optionally skip deduplication
50-
* output: the folder where to publish output
51-
* platform: the platform to be added to the BAM header. Valid values: [ILLUMINA, SOLID, LS454, HELICOS and PACBIO] (default: ILLUMINA)
52-
* prepare_bam_cpus: default 3
53-
* prepare_bam_memory: default 8g
54-
* mark_duplicates_cpus: default 16
55-
* mark_duplicates_memory: default 64g
56-
* realignment_around_indels_cpus: default 2
57-
* realignment_around_indels_memory: default 32g
58-
* bqsr_cpus: default 3
59-
* bqsr_memory: default 4g
60-
61-
Output:
62-
* Preprocessed and indexed BAM
46+
* --reference: path to the FASTA genome reference (indexes expected *.fai, *.dict)
47+
* --dbsnp: path to the dbSNP VCF
48+
* --known_indels1: path to a VCF of known indels
49+
* --known_indels2: path to a second VCF of known indels
50+
**NOTE**: if any of the above parameters is not provided, default hg19 resources under
51+
/projects/data/gatk_bundle/hg19/ will be used
52+
53+
* --intervals: path to an intervals file to collect HS metrics from, this can be built with Picard's BedToIntervalList (default: None)
54+
* --hs_metrics_target_coverage: name of output file for target HS metrics (default: None)
55+
* --hs_metrics_per_base_coverage: name of output file for per base HS metrics (default: None)
56+
* --skip_bqsr: optionally skip BQSR (default: false)
57+
* --skip_realignment: optionally skip realignment (default: false)
58+
* --skip_deduplication: optionally skip deduplication (default: false)
59+
* --skip_metrics: optionally skip metrics (default: false)
60+
* --output: the folder where to publish output (default: ./output)
61+
* --platform: the platform to be added to the BAM header. Valid values: [ILLUMINA, SOLID, LS454, HELICOS and PACBIO] (default: ILLUMINA)
62+
63+
Computational resources:
64+
* --prepare_bam_cpus: (default: 3)
65+
* --prepare_bam_memory: (default: 8g)
66+
* --mark_duplicates_cpus: (default: 16)
67+
* --mark_duplicates_memory: (default: 64g)
68+
* --realignment_around_indels_cpus: (default: 2)
69+
* --realignment_around_indels_memory: (default: 32g)
70+
* --bqsr_cpus: (default: 3)
71+
* --bqsr_memory: (default: 4g)
72+
73+
Output:
74+
* Preprocessed and indexed BAMs
6375
* Tab-separated values file with the absolute paths to the preprocessed BAMs, preprocessed_bams.txt
6476
6577
Optional output:
6678
* Recalibration report
6779
* Realignment intervals
68-
* Duplication metrics
80+
* Metrics
6981
"""
7082
}
7183

@@ -103,8 +115,10 @@ process prepareBam {
103115
set name, type, file(bam) from input_files
104116

105117
output:
106-
set val(name), val("${bam.baseName}"), val(type),
107-
file("${bam.baseName}.prepared.bam"), file("${bam.baseName}.prepared.bai") into prepared_bams
118+
set val(name),
119+
val("${bam.baseName}"),
120+
val(type), file("${bam.baseName}.prepared.bam"),
121+
file("${bam.baseName}.prepared.bai") into prepared_bams, prepared_bams_for_metrics, prepared_bams_for_hs_metrics
108122

109123
"""
110124
mkdir tmp
@@ -131,8 +145,6 @@ process prepareBam {
131145
--RGPL ${params.platform} \
132146
--SORT_ORDER coordinate \
133147
--CREATE_INDEX true
134-
135-
rm -rf tmp
136148
"""
137149
}
138150

@@ -145,16 +157,18 @@ if (!params.skip_deduplication) {
145157
cpus "${params.mark_duplicates_cpus}"
146158
memory "${params.mark_duplicates_memory}"
147159
tag "${name}"
148-
publishDir "${publish_dir}/${name}", mode: "copy", pattern: "*.dedup_metrics.txt"
160+
publishDir "${publish_dir}/${name}/metrics", mode: "copy", pattern: "*.dedup_metrics"
149161

150162
input:
151163
set name, bam_name, type, file(bam), file(bai) from prepared_bams
152164

153165
output:
154166
set val(name), val(bam_name), val(type),
155167
file("${bam.baseName}.dedup.bam"), file("${bam.baseName}.dedup.bam.bai") into deduplicated_bams
156-
file("${bam.baseName}.dedup_metrics.txt") into deduplication_metrics
168+
file("${bam.baseName}.dedup_metrics") optional true into deduplication_metrics
157169

170+
script:
171+
dedup_metrics = params.skip_metrics ? "--metrics-file ${bam.baseName}.dedup_metrics" : ""
158172
"""
159173
mkdir tmp
160174
@@ -163,16 +177,87 @@ if (!params.skip_deduplication) {
163177
--input ${bam} \
164178
--output ${bam.baseName}.dedup.bam \
165179
--conf 'spark.executor.cores=${task.cpus}' \
166-
--metrics-file ${bam.baseName}.dedup_metrics.txt
167-
168-
rm -rf tmp
180+
${dedup_metrics}
169181
"""
170182
}
171183
}
172184
else {
173185
deduplicated_bams = prepared_bams
174186
}
175187

188+
if (! params.skip_metrics) {
189+
190+
if (params.intervals) {
191+
192+
process hsMetrics {
193+
cpus 1
194+
memory "2g"
195+
tag "${name}"
196+
publishDir "${publish_dir}/${name}/metrics", mode: "copy"
197+
198+
input:
199+
set name, bam_name, type, file(bam), file(bai) from prepared_bams_for_hs_metrics
200+
201+
output:
202+
file("*_metrics") optional true into txt_hs_metrics
203+
file("*.pdf") optional true into pdf_hs_metrics
204+
file(params.hs_metrics_target_coverage) optional true into target_hs_metrics
205+
file(params.hs_metrics_per_base_coverage) optional true into per_base_hs_metrics
206+
207+
script:
208+
hs_metrics_target_coverage= params.hs_metrics_target_coverage ?
209+
"--PER_TARGET_COVERAGE ${params.hs_metrics_target_coverage} --REFERENCE_SEQUENCE ${params.reference}" :
210+
""
211+
hs_metrics_per_base_coverage= params.hs_metrics_per_base_coverage ?
212+
"--PER_BASE_COVERAGE ${params.hs_metrics_per_base_coverage}" :
213+
""
214+
"""
215+
mkdir tmp
216+
217+
gatk CollectHsMetrics \
218+
--java-options '-Xmx2g -Djava.io.tmpdir=tmp' \
219+
--INPUT ${bam} \
220+
--OUTPUT ${bam.baseName} \
221+
--TARGET_INTERVALS ${params.intervals} \
222+
--BAIT_INTERVALS ${params.intervals} \
223+
${hs_metrics_target_coverage} ${hs_metrics_per_base_coverage}
224+
"""
225+
}
226+
}
227+
228+
process metrics {
229+
cpus 1
230+
memory "2g"
231+
tag "${name}"
232+
publishDir "${publish_dir}/${name}/metrics", mode: "copy"
233+
234+
input:
235+
set name, bam_name, type, file(bam), file(bai) from prepared_bams_for_metrics
236+
237+
output:
238+
file("*_metrics") optional true into txt_metrics
239+
file("*.pdf") optional true into pdf_metrics
240+
241+
"""
242+
mkdir tmp
243+
244+
gatk CollectMultipleMetrics \
245+
--java-options '-Xmx2g -Djava.io.tmpdir=tmp' \
246+
--INPUT ${bam} \
247+
--OUTPUT ${bam.baseName} \
248+
--REFERENCE_SEQUENCE ${params.reference} \
249+
--PROGRAM QualityScoreDistribution \
250+
--PROGRAM MeanQualityByCycle \
251+
--PROGRAM CollectAlignmentSummaryMetrics \
252+
--PROGRAM CollectBaseDistributionByCycle \
253+
--PROGRAM CollectGcBiasMetrics \
254+
--PROGRAM CollectInsertSizeMetrics \
255+
--PROGRAM CollectSequencingArtifactMetrics \
256+
--PROGRAM CollectSequencingArtifactMetrics
257+
"""
258+
}
259+
}
260+
176261
if (!params.skip_realignment) {
177262
process realignmentAroundindels {
178263
cpus "${params.realignment_around_indels_cpus}"
@@ -207,8 +292,6 @@ if (!params.skip_realignment) {
207292
--consensusDeterminationModel USE_SW \
208293
--LODThresholdForCleaning 0.4 \
209294
--maxReadsInMemory 600000
210-
211-
rm -rf tmp
212295
"""
213296
}
214297
}
@@ -248,8 +331,6 @@ if (!params.skip_bqsr) {
248331
--output ${bam_name}.preprocessed.bam \
249332
--reference ${params.reference} \
250333
--bqsr-recal-file ${bam_name}.recalibration_report.grp
251-
252-
rm -rf tmp
253334
"""
254335
}
255336
}

nextflow.config

Lines changed: 4 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -23,6 +23,7 @@ profiles {
2323
params.bqsr_memory = "3g"
2424
params.known_indels1 = "$baseDir/test_data/1000G_phase1.indels.hg19.sites.minimal.vcf"
2525
params.known_indels2 = "$baseDir/test_data/Mills_and_1000G_gold_standard.indels.hg19.sites.sorted.minimal.vcf"
26+
params.intervals = "$baseDir/test_data/minimal_intervals.intervals"
2627
params.dbsnp = "$baseDir/test_data/dbsnp_138.hg19.minimal.vcf"
2728
}
2829
}
@@ -35,6 +36,8 @@ env {
3536
// Capture exit codes from upstream processes when piping
3637
process.shell = ['/bin/bash', '-euo', 'pipefail']
3738

39+
cleanup = true
40+
3841
timeline {
3942
enabled = true
4043
//file = "${params.output}/execution_timeline.html"
@@ -59,5 +62,5 @@ manifest {
5962
description = 'Picard and GATK BAM preprocessing pipeline'
6063
mainScript = 'main.nf'
6164
nextflowVersion = '>=19.10.0'
62-
version = '1.0.1'
65+
version = '1.1.0'
6366
}

0 commit comments

Comments
 (0)