Merge branch 'develop' into 'master'

Pablo Riesgo Ferreiro · Pablo Riesgo Ferreiro · commit 930e2949d463 · 2021-05-04T13:07:31.000Z
Release v1.1.0

See merge request tron/tron-bam-preprocessing!6
diff --git a/Makefile b/Makefile
@@ -8,7 +8,10 @@ clean:
 	rm -rf .nextflow*
 
 test:
-	nextflow main.nf -profile test,conda --output output/test1
-	nextflow main.nf -profile test,conda --skip_bqsr --output output/test2
-	nextflow main.nf -profile test,conda --skip_realignment --output output/test3
-	nextflow main.nf -profile test,conda --skip_deduplication --output output/test4
+	#nextflow main.nf -profile test,conda --output output/test1
+	#nextflow main.nf -profile test,conda --skip_bqsr --output output/test2
+	#nextflow main.nf -profile test,conda --skip_realignment --output output/test3
+	#nextflow main.nf -profile test,conda --skip_deduplication --output output/test4
+	#nextflow main.nf -profile test,conda --output output/test5 --skip_metrics
+	#nextflow main.nf -profile test,conda --output output/test6 --intervals false
+	nextflow main.nf -profile test,conda --output output/test6 --hs_metrics_target_coverage target_coverage.txt --hs_metrics_per_base_coverage per_base_coverage.txt
diff --git a/README.md b/README.md
@@ -23,23 +23,32 @@ Steps:
 * **Clean BAM**. Sets the mapping quality to 0 for all unmapped reads and avoids soft clipping going beyond the reference genome boundaries. Implemented in Picard
 * **Reorder chromosomes**. Makes the chromosomes in the BAM follow the same order as the reference genome. Implemented in Picard
 * **Add read groups**. GATK requires that some headers are adde to the BAM, also we want to flag somehow the normal and tumor BAMs in the header as some callers, such as Mutect2 require it. Implemented in Picard.
- * **Mark duplicates** (optional). Identify the PCR and the optical duplications and marks those reads. This uses the parallelized version on Spark, it is reported to scale linearly up to 16 CPUs.
- * **Realignment around indels** (optional). This procedure is important for locus based variant callers, but for any variant caller doing haplotype assembly it is not needed. This is computing intensive as it first finds regions for realignment where there are indication of indels  and then it performs a local realignment over those regions. Implemented in GATK3, deprecated in GATK4
- * **Base Quality Score Recalibration (BQSR)** (optional). It aims at correcting systematic errors in the sequencer when assigning the base call quality errors, as these scores are used by variant callers it improves variant calling in some situations. Implemented in GATK4
+* **Mark duplicates** (optional). Identify the PCR and the optical duplications and marks those reads. This uses the parallelized version on Spark, it is reported to scale linearly up to 16 CPUs.
+* **Realignment around indels** (optional). This procedure is important for locus based variant callers, but for any variant caller doing haplotype assembly it is not needed. This is computing intensive as it first finds regions for realignment where there are indication of indels  and then it performs a local realignment over those regions. Implemented in GATK3, deprecated in GATK4
+* **Base Quality Score Recalibration (BQSR)** (optional). It aims at correcting systematic errors in the sequencer when assigning the base call quality errors, as these scores are used by variant callers it improves variant calling in some situations. Implemented in GATK4
+* **Metrics** (optional). A number of metrics are obtained over the BAM file with Picard's CollectMetrics (eg: duplication, insert size, alignment, etc.).
 
 ![Pipeline](bam_preprocessing2.png)
 
+## References
+
+The bam preprocessing workflow use some required references (`--reference`, `--dbsnp`, `--known_indels1` and `--known_indels2`).
+These resources can be fetched from the GATK bundle https://gatk.broadinstitute.org/hc/en-us/articles/360035890811-Resource-bundle.
+
+Optionally, in order to run Picard's CollectHsMetrics an intervals file will need to be provided (`--intervals`). 
+This can be built from a BED file using Picard's BedToIntervalList (https://gatk.broadinstitute.org/hc/en-us/articles/360036883931-BedToIntervalList-Picard-)
+
 ## How to run it
 
 ```
-$ nextflow run tron-bioinformatics/tronflow-bam-preprocessing -r v1.0.0 --help
+$ nextflow run tron-bioinformatics/tronflow-bam-preprocessing -r v1.1.0 --help
 N E X T F L O W  ~  version 19.07.0
 Launching `main.nf` [intergalactic_shannon] - revision: e707c77d7b
 Usage:
     main.nf --input_files input_files
  
 Input:
-    * input_files: the path to a tab-separated values file containing in each row the sample name, sample type (eg: tumor or normal) and path to the BAM file
+    * --input_files: the path to a tab-separated values file containing in each row the sample name, sample type (eg: tumor or normal) and path to the BAM file
     Sample type will be added to the BAM header @SN sample name
     The input file does not have header!
     Example input file:
@@ -48,23 +57,32 @@ Input:
     name2       tumor   tumor.2.bam
  
 Optional input:
-    * reference: path to the FASTA genome reference (indexes expected *.fai, *.dict)
-    * dbsnp: path to the dbSNP VCF
-    * known_indels1: path to a VCF of known indels
-    * known_indels2: path to a second VCF of known indels
-    * NOTE: if any of the above parameters is not provided, default hg19 resources will be used
-    * skip_bqsr: optionally skip BQSR
-    * skip_realignment: optionally skip realignment
-    * skip_deduplication: optionally skip deduplication
-    * output: the folder where to publish output, if not provided they will be moved to "output" folder inside the workflow folder* prepare_bam_cpus: default 3
-    * platform: the platform to be added to the BAM header. Valid values: [ILLUMINA, SOLID, LS454, HELICOS and PACBIO] (default: ILLUMINA)
-    * prepare_bam_memory: default 8g
-    * mark_duplicates_cpus: default 16
-    * mark_duplicates_memory: default 64g
-    * realignment_around_indels_cpus: default 2
-    * realignment_around_indels_memory: default 32g
-    * bqsr_cpus: default 3
-    * bqsr_memory: default 4g
+    * --reference: path to the FASTA genome reference (indexes expected *.fai, *.dict)
+    * --dbsnp: path to the dbSNP VCF
+    * --known_indels1: path to a VCF of known indels
+    * --known_indels2: path to a second VCF of known indels
+    **NOTE**: if any of the above parameters is not provided, default hg19 resources under 
+    /projects/data/gatk_bundle/hg19/ will be used
+    
+    * --intervals: path to an intervals file to collect HS metrics from, this can be built with Picard's BedToIntervalList (default: None)
+    * --hs_metrics_target_coverage: name of output file for target HS metrics (default: None)
+    * --hs_metrics_per_base_coverage: name of output file for per base HS metrics (default: None)
+    * --skip_bqsr: optionally skip BQSR (default: false)
+    * --skip_realignment: optionally skip realignment (default: false)
+    * --skip_deduplication: optionally skip deduplication (default: false)
+    * --skip_metrics: optionally skip metrics (default: false)
+    * --output: the folder where to publish output (default: ./output)
+    * --platform: the platform to be added to the BAM header. Valid values: [ILLUMINA, SOLID, LS454, HELICOS and PACBIO] (default: ILLUMINA)
+    
+Computational resources:
+    * --prepare_bam_cpus: (default: 3)
+    * --prepare_bam_memory: (default: 8g)
+    * --mark_duplicates_cpus: (default: 16)
+    * --mark_duplicates_memory: (default: 64g)
+    * --realignment_around_indels_cpus: (default: 2)
+    * --realignment_around_indels_memory: (default: 32g)
+    * --bqsr_cpus: (default: 3)
+    * --bqsr_memory: (default: 4g)
  
  Output:
     * Preprocessed and indexed BAMs
@@ -73,5 +91,5 @@ Optional input:
 Optional output:
     * Recalibration report
     * Realignment intervals
-    * Duplication metrics
+    * Metrics
 ```
diff --git a/environment.yml b/environment.yml
@@ -1,6 +1,6 @@
 # You can use this file to create a conda environment for this pipeline:
 #   conda env create -f environment.yml
-name: tronflow-bam-preprocessing-1.0.1
+name: tronflow-bam-preprocessing-1.1.0
 channels:
   - conda-forge
   - bioconda
diff --git a/main.nf b/main.nf
@@ -7,9 +7,13 @@ params.reference = "/projects/data/gatk_bundle/hg19/ucsc.hg19.fasta"
 params.dbsnp = "/projects/data/gatk_bundle/hg19/dbsnp_138.hg19.vcf"
 params.known_indels1 = "/projects/data/gatk_bundle/hg19/1000G_phase1.indels.hg19.sites.vcf"
 params.known_indels2 = "/projects/data/gatk_bundle/hg19/Mills_and_1000G_gold_standard.indels.hg19.sites.sorted.vcf"
+params.intervals = false
+params.hs_metrics_target_coverage = false
+params.hs_metrics_per_base_coverage = false
 params.skip_bqsr = false
 params.skip_realignment = false
 params.skip_deduplication = false
+params.skip_metrics = false
 params.output = false
 params.platform = "ILLUMINA"
 
@@ -27,45 +31,53 @@ params.bqsr_memory = "4g"
 def helpMessage() {
     log.info"""
 Usage:
-    bam_preprocessing.nf --input_files input_files --reference reference.fasta
+    main.nf --input_files input_files
 
 Input:
-    * input_files: the path to a tab-separated values file containing in each row the sample name, sample type (tumor or normal) and path to the BAM file
+    * --input_files: the path to a tab-separated values file containing in each row the sample name, sample type (eg: tumor or normal) and path to the BAM file
     Sample type will be added to the BAM header @SN sample name
     The input file does not have header!
     Example input file:
-    name1	tumor	tumor.1.bam
-    name1	normal	normal.1.bam
-    name2	tumor	tumor.2.bam
+    name1       tumor   tumor.1.bam
+    name1       normal  normal.1.bam
+    name2       tumor   tumor.2.bam
 
 Optional input:
-    * reference: path to the FASTA genome reference (indexes expected *.fai, *.dict)
-    * dbsnp: path to the dbSNP VCF
-    * known_indels1: path to a VCF of known indels
-    * known_indels2: path to a second VCF of known indels
-    * NOTE: if any of the above parameters is not provided, default hg19 resources will be used
-    * skip_bqsr: optionally skip BQSR
-    * skip_realignment: optionally skip realignment
-    * skip_deduplication: optionally skip deduplication
-    * output: the folder where to publish output
-    * platform: the platform to be added to the BAM header. Valid values: [ILLUMINA, SOLID, LS454, HELICOS and PACBIO] (default: ILLUMINA)
-    * prepare_bam_cpus: default 3
-    * prepare_bam_memory: default 8g
-    * mark_duplicates_cpus: default 16
-    * mark_duplicates_memory: default 64g
-    * realignment_around_indels_cpus: default 2
-    * realignment_around_indels_memory: default 32g
-    * bqsr_cpus: default 3
-    * bqsr_memory: default 4g
-
-Output:
-    * Preprocessed and indexed BAM
+    * --reference: path to the FASTA genome reference (indexes expected *.fai, *.dict)
+    * --dbsnp: path to the dbSNP VCF
+    * --known_indels1: path to a VCF of known indels
+    * --known_indels2: path to a second VCF of known indels
+    **NOTE**: if any of the above parameters is not provided, default hg19 resources under
+    /projects/data/gatk_bundle/hg19/ will be used
+
+    * --intervals: path to an intervals file to collect HS metrics from, this can be built with Picard's BedToIntervalList (default: None)
+    * --hs_metrics_target_coverage: name of output file for target HS metrics (default: None)
+    * --hs_metrics_per_base_coverage: name of output file for per base HS metrics (default: None)
+    * --skip_bqsr: optionally skip BQSR (default: false)
+    * --skip_realignment: optionally skip realignment (default: false)
+    * --skip_deduplication: optionally skip deduplication (default: false)
+    * --skip_metrics: optionally skip metrics (default: false)
+    * --output: the folder where to publish output (default: ./output)
+    * --platform: the platform to be added to the BAM header. Valid values: [ILLUMINA, SOLID, LS454, HELICOS and PACBIO] (default: ILLUMINA)
+
+Computational resources:
+    * --prepare_bam_cpus: (default: 3)
+    * --prepare_bam_memory: (default: 8g)
+    * --mark_duplicates_cpus: (default: 16)
+    * --mark_duplicates_memory: (default: 64g)
+    * --realignment_around_indels_cpus: (default: 2)
+    * --realignment_around_indels_memory: (default: 32g)
+    * --bqsr_cpus: (default: 3)
+    * --bqsr_memory: (default: 4g)
+
+ Output:
+    * Preprocessed and indexed BAMs
     * Tab-separated values file with the absolute paths to the preprocessed BAMs, preprocessed_bams.txt
 
 Optional output:
     * Recalibration report
     * Realignment intervals
-    * Duplication metrics
+    * Metrics
     """
 }
 
@@ -103,8 +115,10 @@ process prepareBam {
     	set name, type, file(bam) from input_files
 
     output:
-      set val(name), val("${bam.baseName}"), val(type),
-        file("${bam.baseName}.prepared.bam"), file("${bam.baseName}.prepared.bai")  into prepared_bams
+      set val(name),
+        val("${bam.baseName}"),
+        val(type), file("${bam.baseName}.prepared.bam"),
+        file("${bam.baseName}.prepared.bai")  into prepared_bams, prepared_bams_for_metrics, prepared_bams_for_hs_metrics
 
     """
     mkdir tmp
@@ -131,8 +145,6 @@ process prepareBam {
     --RGPL ${params.platform} \
     --SORT_ORDER coordinate \
     --CREATE_INDEX true
-
-    rm -rf tmp
     """
 }
 
@@ -145,16 +157,18 @@ if (!params.skip_deduplication) {
 	    cpus "${params.mark_duplicates_cpus}"
         memory "${params.mark_duplicates_memory}"
 	    tag "${name}"
-	    publishDir "${publish_dir}/${name}", mode: "copy", pattern: "*.dedup_metrics.txt"
+	    publishDir "${publish_dir}/${name}/metrics", mode: "copy", pattern: "*.dedup_metrics"
 
 	    input:
 	    	set name, bam_name, type, file(bam), file(bai) from prepared_bams
 
 	    output:
 	    	set val(name), val(bam_name), val(type),
 	    	    file("${bam.baseName}.dedup.bam"), file("${bam.baseName}.dedup.bam.bai") into deduplicated_bams
-	    	file("${bam.baseName}.dedup_metrics.txt") into deduplication_metrics
+	    	file("${bam.baseName}.dedup_metrics") optional true into deduplication_metrics
 
+        script:
+        dedup_metrics = params.skip_metrics ? "--metrics-file ${bam.baseName}.dedup_metrics" : ""
 	    """
 	    mkdir tmp
 
@@ -163,16 +177,87 @@ if (!params.skip_deduplication) {
         --input  ${bam} \
         --output ${bam.baseName}.dedup.bam \
         --conf 'spark.executor.cores=${task.cpus}' \
-        --metrics-file ${bam.baseName}.dedup_metrics.txt
-
-        rm -rf tmp
+        ${dedup_metrics}
 	    """
 	}
 }
 else {
     deduplicated_bams = prepared_bams
 }
 
+if (! params.skip_metrics) {
+
+    if (params.intervals) {
+
+        process hsMetrics {
+            cpus 1
+            memory "2g"
+            tag "${name}"
+            publishDir "${publish_dir}/${name}/metrics", mode: "copy"
+
+            input:
+                set name, bam_name, type, file(bam), file(bai) from prepared_bams_for_hs_metrics
+
+            output:
+                file("*_metrics") optional true into txt_hs_metrics
+                file("*.pdf") optional true into pdf_hs_metrics
+                file(params.hs_metrics_target_coverage) optional true into target_hs_metrics
+                file(params.hs_metrics_per_base_coverage) optional true into per_base_hs_metrics
+
+            script:
+            hs_metrics_target_coverage= params.hs_metrics_target_coverage ?
+                "--PER_TARGET_COVERAGE ${params.hs_metrics_target_coverage} --REFERENCE_SEQUENCE ${params.reference}" :
+                ""
+            hs_metrics_per_base_coverage= params.hs_metrics_per_base_coverage ?
+                "--PER_BASE_COVERAGE ${params.hs_metrics_per_base_coverage}" :
+                ""
+            """
+            mkdir tmp
+
+            gatk CollectHsMetrics \
+            --java-options '-Xmx2g  -Djava.io.tmpdir=tmp' \
+            --INPUT  ${bam} \
+            --OUTPUT ${bam.baseName} \
+            --TARGET_INTERVALS ${params.intervals} \
+            --BAIT_INTERVALS ${params.intervals} \
+            ${hs_metrics_target_coverage} ${hs_metrics_per_base_coverage}
+            """
+        }
+    }
+
+    process metrics {
+	    cpus 1
+        memory "2g"
+	    tag "${name}"
+	    publishDir "${publish_dir}/${name}/metrics", mode: "copy"
+
+	    input:
+	    	set name, bam_name, type, file(bam), file(bai) from prepared_bams_for_metrics
+
+	    output:
+	        file("*_metrics") optional true into txt_metrics
+	        file("*.pdf") optional true into pdf_metrics
+
+	    """
+	    mkdir tmp
+
+	    gatk CollectMultipleMetrics \
+        --java-options '-Xmx2g  -Djava.io.tmpdir=tmp' \
+        --INPUT  ${bam} \
+        --OUTPUT ${bam.baseName} \
+        --REFERENCE_SEQUENCE ${params.reference} \
+        --PROGRAM QualityScoreDistribution \
+        --PROGRAM MeanQualityByCycle \
+        --PROGRAM CollectAlignmentSummaryMetrics \
+        --PROGRAM CollectBaseDistributionByCycle \
+        --PROGRAM CollectGcBiasMetrics \
+        --PROGRAM CollectInsertSizeMetrics \
+        --PROGRAM CollectSequencingArtifactMetrics \
+        --PROGRAM CollectSequencingArtifactMetrics
+	    """
+	}
+}
+
 if (!params.skip_realignment) {
 	process realignmentAroundindels {
 	    cpus "${params.realignment_around_indels_cpus}"
@@ -207,8 +292,6 @@ if (!params.skip_realignment) {
 	    --consensusDeterminationModel USE_SW \
 	    --LODThresholdForCleaning 0.4 \
 	    --maxReadsInMemory 600000
-
-	    rm -rf tmp
 	    """
 	}
 }
@@ -248,8 +331,6 @@ if (!params.skip_bqsr) {
 	    --output ${bam_name}.preprocessed.bam \
 	    --reference ${params.reference} \
 	    --bqsr-recal-file ${bam_name}.recalibration_report.grp
-
-	    rm -rf tmp
 	    """
 	}
 }
diff --git a/nextflow.config b/nextflow.config
@@ -23,6 +23,7 @@ profiles {
     params.bqsr_memory = "3g"
     params.known_indels1 = "$baseDir/test_data/1000G_phase1.indels.hg19.sites.minimal.vcf"
     params.known_indels2 = "$baseDir/test_data/Mills_and_1000G_gold_standard.indels.hg19.sites.sorted.minimal.vcf"
+    params.intervals = "$baseDir/test_data/minimal_intervals.intervals"
     params.dbsnp = "$baseDir/test_data/dbsnp_138.hg19.minimal.vcf"
   }
 }
@@ -35,6 +36,8 @@ env {
 // Capture exit codes from upstream processes when piping
 process.shell = ['/bin/bash', '-euo', 'pipefail']
 
+cleanup = true
+
 timeline {
   enabled = true
   //file = "${params.output}/execution_timeline.html"
@@ -59,5 +62,5 @@ manifest {
   description = 'Picard and GATK BAM preprocessing pipeline'
   mainScript = 'main.nf'
   nextflowVersion = '>=19.10.0'
-  version = '1.0.1'
+  version = '1.1.0'
 }
diff --git a/test_data/minimal_intervals.bed b/test_data/minimal_intervals.bed
diff --git a/test_data/minimal_intervals.intervals b/test_data/minimal_intervals.intervals