You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: README.md
+41-23Lines changed: 41 additions & 23 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -23,23 +23,32 @@ Steps:
23
23
***Clean BAM**. Sets the mapping quality to 0 for all unmapped reads and avoids soft clipping going beyond the reference genome boundaries. Implemented in Picard
24
24
***Reorder chromosomes**. Makes the chromosomes in the BAM follow the same order as the reference genome. Implemented in Picard
25
25
***Add read groups**. GATK requires that some headers are adde to the BAM, also we want to flag somehow the normal and tumor BAMs in the header as some callers, such as Mutect2 require it. Implemented in Picard.
26
-
***Mark duplicates** (optional). Identify the PCR and the optical duplications and marks those reads. This uses the parallelized version on Spark, it is reported to scale linearly up to 16 CPUs.
27
-
***Realignment around indels** (optional). This procedure is important for locus based variant callers, but for any variant caller doing haplotype assembly it is not needed. This is computing intensive as it first finds regions for realignment where there are indication of indels and then it performs a local realignment over those regions. Implemented in GATK3, deprecated in GATK4
28
-
***Base Quality Score Recalibration (BQSR)** (optional). It aims at correcting systematic errors in the sequencer when assigning the base call quality errors, as these scores are used by variant callers it improves variant calling in some situations. Implemented in GATK4
26
+
***Mark duplicates** (optional). Identify the PCR and the optical duplications and marks those reads. This uses the parallelized version on Spark, it is reported to scale linearly up to 16 CPUs.
27
+
***Realignment around indels** (optional). This procedure is important for locus based variant callers, but for any variant caller doing haplotype assembly it is not needed. This is computing intensive as it first finds regions for realignment where there are indication of indels and then it performs a local realignment over those regions. Implemented in GATK3, deprecated in GATK4
28
+
***Base Quality Score Recalibration (BQSR)** (optional). It aims at correcting systematic errors in the sequencer when assigning the base call quality errors, as these scores are used by variant callers it improves variant calling in some situations. Implemented in GATK4
29
+
***Metrics** (optional). A number of metrics are obtained over the BAM file with Picard's CollectMetrics (eg: duplication, insert size, alignment, etc.).
29
30
30
31

31
32
33
+
## References
34
+
35
+
The bam preprocessing workflow use some required references (`--reference`, `--dbsnp`, `--known_indels1` and `--known_indels2`).
36
+
These resources can be fetched from the GATK bundle https://gatk.broadinstitute.org/hc/en-us/articles/360035890811-Resource-bundle.
37
+
38
+
Optionally, in order to run Picard's CollectHsMetrics an intervals file will need to be provided (`--intervals`).
39
+
This can be built from a BED file using Picard's BedToIntervalList (https://gatk.broadinstitute.org/hc/en-us/articles/360036883931-BedToIntervalList-Picard-)
40
+
32
41
## How to run it
33
42
34
43
```
35
-
$ nextflow run tron-bioinformatics/tronflow-bam-preprocessing -r v1.0.0 --help
44
+
$ nextflow run tron-bioinformatics/tronflow-bam-preprocessing -r v1.1.0 --help
* input_files: the path to a tab-separated values file containing in each row the sample name, sample type (eg: tumor or normal) and path to the BAM file
51
+
* --input_files: the path to a tab-separated values file containing in each row the sample name, sample type (eg: tumor or normal) and path to the BAM file
43
52
Sample type will be added to the BAM header @SN sample name
44
53
The input file does not have header!
45
54
Example input file:
@@ -48,23 +57,32 @@ Input:
48
57
name2 tumor tumor.2.bam
49
58
50
59
Optional input:
51
-
* reference: path to the FASTA genome reference (indexes expected *.fai, *.dict)
52
-
* dbsnp: path to the dbSNP VCF
53
-
* known_indels1: path to a VCF of known indels
54
-
* known_indels2: path to a second VCF of known indels
55
-
* NOTE: if any of the above parameters is not provided, default hg19 resources will be used
* output: the folder where to publish output, if not provided they will be moved to "output" folder inside the workflow folder* prepare_bam_cpus: default 3
60
-
* platform: the platform to be added to the BAM header. Valid values: [ILLUMINA, SOLID, LS454, HELICOS and PACBIO] (default: ILLUMINA)
61
-
* prepare_bam_memory: default 8g
62
-
* mark_duplicates_cpus: default 16
63
-
* mark_duplicates_memory: default 64g
64
-
* realignment_around_indels_cpus: default 2
65
-
* realignment_around_indels_memory: default 32g
66
-
* bqsr_cpus: default 3
67
-
* bqsr_memory: default 4g
60
+
* --reference: path to the FASTA genome reference (indexes expected *.fai, *.dict)
61
+
* --dbsnp: path to the dbSNP VCF
62
+
* --known_indels1: path to a VCF of known indels
63
+
* --known_indels2: path to a second VCF of known indels
64
+
**NOTE**: if any of the above parameters is not provided, default hg19 resources under
65
+
/projects/data/gatk_bundle/hg19/ will be used
66
+
67
+
* --intervals: path to an intervals file to collect HS metrics from, this can be built with Picard's BedToIntervalList (default: None)
68
+
* --hs_metrics_target_coverage: name of output file for target HS metrics (default: None)
69
+
* --hs_metrics_per_base_coverage: name of output file for per base HS metrics (default: None)
* input_files: the path to a tab-separated values file containing in each row the sample name, sample type (tumor or normal) and path to the BAM file
37
+
* --input_files: the path to a tab-separated values file containing in each row the sample name, sample type (eg: tumor or normal) and path to the BAM file
34
38
Sample type will be added to the BAM header @SN sample name
35
39
The input file does not have header!
36
40
Example input file:
37
-
name1tumortumor.1.bam
38
-
name1normalnormal.1.bam
39
-
name2tumortumor.2.bam
41
+
name1tumortumor.1.bam
42
+
name1normalnormal.1.bam
43
+
name2tumortumor.2.bam
40
44
41
45
Optional input:
42
-
* reference: path to the FASTA genome reference (indexes expected *.fai, *.dict)
43
-
* dbsnp: path to the dbSNP VCF
44
-
* known_indels1: path to a VCF of known indels
45
-
* known_indels2: path to a second VCF of known indels
46
-
* NOTE: if any of the above parameters is not provided, default hg19 resources will be used
0 commit comments