Skip to content

Commit ce10fdb

Browse files
authored
Merge pull request #1 from TRON-Bioinformatics/migrate-dsl2
Migrate to Nextflow DSL 2
2 parents e89814c + fb2a56b commit ce10fdb

22 files changed

+801
-346
lines changed

.github/workflows/automated_tests.yml

Lines changed: 32 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,32 @@
1+
name: Automated tests
2+
3+
on: [push]
4+
5+
jobs:
6+
test:
7+
runs-on: ubuntu-20.04
8+
9+
steps:
10+
- uses: actions/checkout@v2
11+
- uses: actions/setup-java@v2
12+
with:
13+
distribution: 'zulu' # See 'Supported distributions' for available options
14+
java-version: '11'
15+
- uses: conda-incubator/setup-miniconda@v2
16+
with:
17+
auto-update-conda: true
18+
channels: defaults,conda-forge,bioconda
19+
- name: Install dependencies
20+
run: |
21+
apt-get update && apt-get --assume-yes install wget make procps software-properties-common
22+
wget -qO- https://get.nextflow.io | bash && cp nextflow /usr/local/bin/nextflow
23+
conda update conda
24+
- name: Cache conda environments
25+
uses: actions/cache@v2
26+
with:
27+
path: |
28+
/home/runner/work/tronflow-bam-preprocessing/tronflow-bam-preprocessing/work/conda
29+
key: ${{ runner.os }}-tronflow-bam-preprocessing
30+
- name: Run tests
31+
run: |
32+
make

Makefile

Lines changed: 11 additions & 49 deletions
Original file line numberDiff line numberDiff line change
@@ -1,56 +1,18 @@
1-
all : clean test check
1+
all : clean test
22

33
clean:
44
rm -rf output
55
rm -f .nextflow.log*
66
rm -rf .nextflow*
77

88
test:
9-
nextflow main.nf --help
10-
nextflow main.nf -profile test,conda --output output/test1
11-
nextflow main.nf -profile test,conda --skip_bqsr --output output/test2
12-
nextflow main.nf -profile test,conda --skip_realignment --output output/test3
13-
nextflow main.nf -profile test,conda --skip_deduplication --output output/test4
14-
nextflow main.nf -profile test,conda --output output/test5 --skip_deduplication --skip_bqsr --skip_metrics --known_indels1 false --known_indels2 false
15-
nextflow main.nf -profile test,conda --output output/test6 --intervals false --skip_deduplication --skip_bqsr --skip_realignment
16-
nextflow main.nf -profile test,conda --output output/test7 --skip_bqsr --skip_realignment
17-
nextflow main.nf -profile test,conda --output output/test8 --collect_hs_metrics_min_base_quality 10 --collect_hs_metrics_min_mapping_quality 10 --remove_duplicates false --skip_bqsr --skip_realignment
18-
nextflow main.nf -profile test,conda --output output/test9 --skip_deduplication --skip_bqsr --skip_realignment --input_files false --input_bam test_data/TESTX_S1_L001.bam
19-
20-
check:
21-
test -s output/test1/sample1/TESTX_S1_L001.preprocessed.bam || { echo "Missing test 1 output file!"; exit 1; }
22-
test -s output/test1/sample1/TESTX_S1_L001.preprocessed.bai || { echo "Missing test 1 output file!"; exit 1; }
23-
test -s output/test1/sample2/TESTX_S1_L002.preprocessed.bam || { echo "Missing test 1 output file!"; exit 1; }
24-
test -s output/test1/sample2/TESTX_S1_L002.preprocessed.bai || { echo "Missing test 1 output file!"; exit 1; }
25-
test -s output/test2/sample1/TESTX_S1_L001.preprocessed.bam || { echo "Missing test 2 output file!"; exit 1; }
26-
test -s output/test2/sample1/TESTX_S1_L001.preprocessed.bai || { echo "Missing test 2 output file!"; exit 1; }
27-
test -s output/test2/sample2/TESTX_S1_L002.preprocessed.bam || { echo "Missing test 2 output file!"; exit 1; }
28-
test -s output/test2/sample2/TESTX_S1_L002.preprocessed.bai || { echo "Missing test 2 output file!"; exit 1; }
29-
test -s output/test3/sample1/TESTX_S1_L001.preprocessed.bam || { echo "Missing test 3 output file!"; exit 1; }
30-
test -s output/test3/sample1/TESTX_S1_L001.preprocessed.bai || { echo "Missing test 3 output file!"; exit 1; }
31-
test -s output/test3/sample2/TESTX_S1_L002.preprocessed.bam || { echo "Missing test 3 output file!"; exit 1; }
32-
test -s output/test3/sample2/TESTX_S1_L002.preprocessed.bai || { echo "Missing test 3 output file!"; exit 1; }
33-
test -s output/test4/sample1/TESTX_S1_L001.preprocessed.bam || { echo "Missing test 4 output file!"; exit 1; }
34-
test -s output/test4/sample1/TESTX_S1_L001.preprocessed.bai || { echo "Missing test 4 output file!"; exit 1; }
35-
test -s output/test4/sample2/TESTX_S1_L002.preprocessed.bam || { echo "Missing test 4 output file!"; exit 1; }
36-
test -s output/test4/sample2/TESTX_S1_L002.preprocessed.bai || { echo "Missing test 4 output file!"; exit 1; }
37-
test -s output/test5/sample1/TESTX_S1_L001.preprocessed.bam || { echo "Missing test 5 output file!"; exit 1; }
38-
test -s output/test5/sample1/TESTX_S1_L001.preprocessed.bai || { echo "Missing test 5 output file!"; exit 1; }
39-
test -s output/test5/sample2/TESTX_S1_L002.preprocessed.bam || { echo "Missing test 5 output file!"; exit 1; }
40-
test -s output/test5/sample2/TESTX_S1_L002.preprocessed.bai || { echo "Missing test 5 output file!"; exit 1; }
41-
test -s output/test6/sample1/TESTX_S1_L001.preprocessed.bam || { echo "Missing test 6 output file!"; exit 1; }
42-
test -s output/test6/sample1/TESTX_S1_L001.preprocessed.bai || { echo "Missing test 6 output file!"; exit 1; }
43-
test -s output/test6/sample2/TESTX_S1_L002.preprocessed.bam || { echo "Missing test 6 output file!"; exit 1; }
44-
test -s output/test6/sample2/TESTX_S1_L002.preprocessed.bai || { echo "Missing test 6 output file!"; exit 1; }
45-
test -s output/test7/sample1/TESTX_S1_L001.preprocessed.bam || { echo "Missing test 7 output file!"; exit 1; }
46-
test -s output/test7/sample1/TESTX_S1_L001.preprocessed.bai || { echo "Missing test 7 output file!"; exit 1; }
47-
test -s output/test7/sample2/TESTX_S1_L002.preprocessed.bam || { echo "Missing test 7 output file!"; exit 1; }
48-
test -s output/test7/sample2/TESTX_S1_L002.preprocessed.bai || { echo "Missing test 7 output file!"; exit 1; }
49-
test -s output/test8/sample1/TESTX_S1_L001.preprocessed.bam || { echo "Missing test 8 output file!"; exit 1; }
50-
test -s output/test8/sample1/TESTX_S1_L001.preprocessed.bai || { echo "Missing test 8 output file!"; exit 1; }
51-
test -s output/test8/sample1/metrics/TESTX_S1_L001.prepared.dedup.hs_metrics.txt || { echo "Missing test 8 output file!"; exit 1; }
52-
test -s output/test8/sample1/metrics/TESTX_S1_L001.prepared.dedup_metrics.txt || { echo "Missing test 8 output file!"; exit 1; }
53-
test -s output/test8/sample2/TESTX_S1_L002.preprocessed.bam || { echo "Missing test 8 output file!"; exit 1; }
54-
test -s output/test8/sample2/TESTX_S1_L002.preprocessed.bai || { echo "Missing test 8 output file!"; exit 1; }
55-
test -s output/test9/TESTX_S1_L001/TESTX_S1_L001.preprocessed.bam || { echo "Missing test 9 output file!"; exit 1; }
56-
test -s output/test9/TESTX_S1_L001/TESTX_S1_L001.preprocessed.bai || { echo "Missing test 9 output file!"; exit 1; }
9+
bash tests/test_00.sh
10+
bash tests/test_01.sh
11+
bash tests/test_02.sh
12+
bash tests/test_03.sh
13+
bash tests/test_04.sh
14+
bash tests/test_05.sh
15+
bash tests/test_06.sh
16+
bash tests/test_07.sh
17+
bash tests/test_08.sh
18+
bash tests/test_09.sh

README.md

Lines changed: 21 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -1,25 +1,34 @@
11
# TronFlow BAM preprocessing pipeline
22

33
![GitHub tag (latest SemVer)](https://img.shields.io/github/v/release/tron-bioinformatics/tronflow-bam-preprocessing?sort=semver)
4+
[![Automated tests](https://github.com/TRON-Bioinformatics/tronflow-bam-preprocessing/actions/workflows/automated_tests.yml/badge.svg)](https://github.com/TRON-Bioinformatics/tronflow-bam-preprocessing/actions/workflows/automated_tests.yml)
45
[![DOI](https://zenodo.org/badge/358400957.svg)](https://zenodo.org/badge/latestdoi/358400957)
56
[![License](https://img.shields.io/badge/license-MIT-green)](https://opensource.org/licenses/MIT)
67
[![Powered by Nextflow](https://img.shields.io/badge/powered%20by-Nextflow-orange.svg?style=flat&colorA=E1523D&colorB=007D8A)](https://www.nextflow.io/)
78

9+
The TronFlow BAM preprocessing pipeline is part of a collection of computational workflows for tumor-normal pair
10+
somatic variant calling. These workflows are implemented in the Nextflow (Di Tommaso, 2017) framework.
811

12+
Find the documentation here [![Documentation Status](https://readthedocs.org/projects/tronflow-docs/badge/?version=latest)](https://tronflow-docs.readthedocs.io/en/latest/?badge=latest)
913

10-
Nextflow (Di Tommaso, 2017) pipeline for the preprocessing of BAM files based on Picard and GATK (DePristo, 2011).
14+
15+
The aim of this workflow is to preprocess BAM files based on Picard and GATK (DePristo, 2011) best practices.
1116

1217

1318
## Background
1419

15-
In order to have a variant calling ready BAM file there are a number of operations that need to be applied on the BAM. This pipeline depends on the particular variant caller, but there are some common operations.
20+
In order to have a variant calling ready BAM file there are a number of operations that need to be applied on the BAM.
21+
This pipeline depends on the particular variant caller, but there are some common operations.
1622

17-
GATK has been providing a well known best practices document on BAM preprocessing, the latest best practices for GATK4 (https://software.broadinstitute.org/gatk/best-practices/workflow?id=11165) does not perform anymore realignment around indels as opposed to best practices for GATK3 (https://software.broadinstitute.org/gatk/documentation/article?id=3238). This pipeline is based on both Picard and GATK. These best practices have been implemented a number of times, see for instance this implementation in Workflow Definition Language https://github.com/gatk-workflows/gatk4-data-processing/blob/master/processing-for-variant-discovery-gatk4.wdl.
23+
GATK has been providing a well known best practices document on BAM preprocessing, the latest best practices for
24+
GATK4 (https://software.broadinstitute.org/gatk/best-practices/workflow?id=11165) does not perform anymore realignment around indels as opposed to best practices for GATK3 (https://software.broadinstitute.org/gatk/documentation/article?id=3238). This pipeline is based on both Picard and GATK. These best practices have been implemented a number of times, see for instance this implementation in Workflow Definition Language https://github.com/gatk-workflows/gatk4-data-processing/blob/master/processing-for-variant-discovery-gatk4.wdl.
1825

1926

2027
## Objectives
2128

22-
We aim at providing a single implementation of the BAM preprocessing pipeline that can be used across different situations. For this purpose there are some required steps and some optional steps. This is implemented as a Nextflow pipeline to simplify parallelization of execution in the cluster.
29+
We aim at providing a single implementation of the BAM preprocessing pipeline that can be used across different
30+
use cases.
31+
For this purpose there are some required steps and some optional steps.
2332

2433
The input can be either a tab-separated values file (`--input_files`) where each line corresponds to one input BAM or a single BAM (`--input_bam` and `--input_name`).
2534

@@ -33,7 +42,7 @@ Steps:
3342
* **Mark duplicates** (optional). Identify the PCR and the optical duplications and marks those reads. This uses the parallelized version on Spark, it is reported to scale linearly up to 16 CPUs.
3443
* **Realignment around indels** (optional). This procedure is important for locus based variant callers, but for any variant caller doing haplotype assembly it is not needed. This is computing intensive as it first finds regions for realignment where there are indication of indels and then it performs a local realignment over those regions. Implemented in GATK3, deprecated in GATK4
3544
* **Base Quality Score Recalibration (BQSR)** (optional). It aims at correcting systematic errors in the sequencer when assigning the base call quality errors, as these scores are used by variant callers it improves variant calling in some situations. Implemented in GATK4
36-
* **Metrics** (optional). A number of metrics are obtained over the BAM file with Picard's CollectMetrics (eg: duplication, insert size, alignment, etc.).
45+
* **Metrics** (optional). A number of metrics are obtained from the BAM file with Picard's CollectMetrics, CollectHsMetrics and samtools' coverage and depth.
3746

3847
![Pipeline](figures/bam_preprocessing2.png)
3948

@@ -44,8 +53,9 @@ Base Quality Score Recalibration (BQSR) requires dbSNP to avoid extracting error
4453
Realignment around indels requires a set of known indels (`--known_indels1` and `--known_indels2`).
4554
These resources can be fetched from the GATK bundle https://gatk.broadinstitute.org/hc/en-us/articles/360035890811-Resource-bundle.
4655

47-
Optionally, in order to run Picard's CollectHsMetrics an intervals file will need to be provided (`--intervals`).
48-
This can be built from a BED file using Picard's BedToIntervalList (https://gatk.broadinstitute.org/hc/en-us/articles/360036883931-BedToIntervalList-Picard-)
56+
Optionally, in order to run Picard's CollectHsMetrics a BED file will need to be provided (`--intervals`).
57+
This BED file will also be used for `samtools coverage`.
58+
4959

5060
## How to run it
5161

@@ -103,8 +113,11 @@ Computational resources:
103113
104114
Optional output:
105115
* Recalibration report
116+
* Deduplication metrics
106117
* Realignment intervals
107-
* Metrics
118+
* GATK multiple metrics
119+
* HS metrics
120+
* Horizontal and vertical coverage metrics
108121
```
109122

110123

environment.yml

Lines changed: 0 additions & 10 deletions
This file was deleted.

0 commit comments

Comments
 (0)