Skip to content

Commit e154168

Browse files
committed
added 1) bed->bigBed conversion, 2) QC spreadsheet generation, 3) ENCODE accession data
1 parent 44ad4e5 commit e154168

20 files changed

+864
-31
lines changed

README.md

Lines changed: 116 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -361,7 +361,122 @@ There are two kinds of HTML reports provided by the pipeline.
361361
362362
If you stop a BDS pipeline with `Ctrl+C` while calling peaks with `spp`. Temporary files generated by `Rscript` are not removed and they are still on `$TMP` (or `/tmp` if not explicitly exported). You need to manually remove them.
363363
364-
## Programming with BDS
364+
365+
# Output directory structure and file naming
366+
367+
For more details, refer to the file table section in an HTML report generated by the pipeline.
368+
369+
```
370+
out # root dir. of outputs
371+
372+
*report.html # HTML report
373+
*tracks.json # Tracks datahub (JSON) for WashU browser
374+
├ ENCODE_summary.json # Metadata of all datafiles and QC results
375+
376+
├ align # mapped alignments
377+
│ ├ rep1 # for true replicate 1
378+
│ │ ├ *.bam # raw bam
379+
│ │ ├ *.nodup.bam # filtered and deduped bam
380+
│ │ ├ *.tagAlign.gz # tagAlign (bed6) generated from filtered bam
381+
│ │ ├ *.*M.tagAlign.gz # subsampled tagAlign for cross-corr. analysis
382+
│ ├ rep2 # for true repilicate 2
383+
│ ...
384+
│ ├ ctl1 # for control 1
385+
│ ...
386+
│ ├ pooled_rep # for pooled replicate
387+
│ ├ pseudo_reps # for self pseudo replicates
388+
│ │ ├ rep1 # for replicate 1
389+
│ │ │ ├ pr1 # for self pseudo replicate 1 of replicate 1
390+
│ │ │ └ pr2 # for self pseudo replicate 2 of replicate 1
391+
│ │ ├ rep2 # for repilicate 2
392+
│ │ ...
393+
│ └ pooled_pseudo_reps # for pooled pseudo replicates
394+
│ ├ ppr1 # for pooled pseudo replicate 1 (rep1-pr1 + rep2-pr1 + ...)
395+
│ └ ppr2 # for pooled pseudo replicate 2 (rep1-pr2 + rep2-pr2 + ...)
396+
397+
├ peak # peaks called
398+
│ ├ macs2 # peaks generated by MACS2
399+
│ │ ├ rep1 # for replicate 1
400+
│ │ │ ├ *.narrowPeak.gz # narrowPeak
401+
│ │ │ ├ *.gappedPeak.gz # gappedPeak
402+
│ │ │ ├ *.filt.narrowPeak.gz # blacklist filtered narrowPeak
403+
│ │ │ ├ *.filt.gappedPeak.gz # blacklist filtered gappedPeak
404+
│ │ ├ rep2 # for replicate 2
405+
│ │ ...
406+
│ │ ├ pseudo_reps # for self pseudo replicates
407+
│ │ └ pooled_pseudo_reps # for pooled pseudo replicates
408+
│ │
409+
│ ├ spp # peaks generated by SPP
410+
│ │ ├ rep1 # for replicate 1
411+
│ │ │ ├ *.regionPeak.gz # regionPeak (narrowPeak format) generated from SPP
412+
│ │ │ ├ *.filt.regionPeak.gz # blacklist filtered narrowPeak
413+
│ │ ├ rep2 # for replicate 2
414+
│ │ ...
415+
│ │ ├ pseudo_reps # for self pseudo replicates
416+
│ │ └ pooled_pseudo_reps # for pooled pseudo replicates
417+
│ │
418+
│ └ idr # IDR thresholded peaks (using peaks from SPP)
419+
│ ├ true_reps # for replicate 1
420+
│ │ ├ *.narrowPeak.gz # IDR thresholded narrowPeak
421+
│ │ ├ *.filt.narrowPeak.gz # IDR thresholded narrowPeak (blacklist filtered)
422+
│ │ └ *.12-col.bed.gz # IDR thresholded narrowPeak track for WashU browser
423+
│ ├ pseudo_reps # for self pseudo replicates
424+
│ │ ├ rep1 # for replicate 1
425+
│ │ ...
426+
│ ├ optimal_set # optimal IDR thresholded peaks
427+
│ │ └ *.filt.narrowPeak.gz # IDR thresholded narrowPeak (blacklist filtered)
428+
│ ├ conservative_set # optimal IDR thresholded peaks
429+
│ │ └ *.filt.narrowPeak.gz # IDR thresholded narrowPeak (blacklist filtered)
430+
│ ├ pseudo_reps # for self pseudo replicates
431+
│ └ pooled_pseudo_reps # for pooled pseudo replicate
432+
433+
├ qc # QC logs
434+
│ ├ *IDR_final.qc # Final IDR QC
435+
│ ├ rep1 # for true replicate 1
436+
│ │ ├ *.flagstat.qc # Flagstat QC for raw bam
437+
│ │ ├ *.dup.qc # Picard (or sambamba) MarkDuplicate QC log
438+
│ │ ├ *.pbc.qc # PBC QC
439+
│ │ ├ *.nodup.flagstat.qc # Flagstat QC for filtered bam
440+
│ │ ├ *M.cc.qc # Cross-correlation analysis score for tagAlign
441+
│ │ └ *M.cc.plot.pdf/png # Cross-correlation analysis plot for tagAlign
442+
│ ...
443+
444+
├ signal # signal tracks
445+
│ ├ macs2 # signal tracks generated by MACS2
446+
│ │ ├ rep1 # for true replicate 1
447+
│ │ │ ├ *.pval.signal.bigwig (E) # signal track for p-val
448+
│ │ │ └ *.fc.signal.bigwig (E) # signal track for fold change
449+
│ ...
450+
│ └ pooled_rep # for pooled replicate
451+
452+
└ report # files for HTML report
453+
```
454+
455+
## QC metrics spreadsheet (TSV) generation
456+
457+
For each pipeline rune, `ENCODE_summary.json` file is generated under the output directory (`-out_dir`). This JSON file includes all metadata and QC metrics.
458+
459+
`./utils/parse_summary_qc_recursively.py` recursively finds `ENCODE_summary.json` files and parse them to generate one big TSV spreadsheet for QC metrics.
460+
461+
```
462+
$ python parse_summary_qc_recursively.py -h
463+
usage: ENCODE_summary.json parser for QC [-h] [--out-file OUT_FILE]
464+
[--search-dir SEARCH_DIR]
465+
[--json-file JSON_FILE]
466+
467+
Recursively find ENCODE_summary.json, parse it and make a TSV spreadsheet of
468+
QC metrics.
469+
470+
optional arguments:
471+
-h, --help show this help message and exit
472+
--out-file OUT_FILE Output TSV filename)
473+
--search-dir SEARCH_DIR
474+
Root directory to search for ENCODE_summary.json
475+
--json-file JSON_FILE
476+
Specify json file name to be parsed
477+
```
478+
479+
# Programming with BDS
365480
366481
* [Using genomic pipeline modules in Kundaje lab](https://kundajelab.github.io/bds_pipeline_modules/programming.html)
367482

chipseq.bds

Lines changed: 28 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -35,6 +35,7 @@ include "modules/callpeak_spp.bds"
3535
include "modules/callpeak_macs2.bds"
3636
include "modules/callpeak_naive_overlap.bds"
3737
include "modules/callpeak_idr.bds"
38+
include "modules/callpeak_blacklist_filter.bds"
3839

3940
include "modules/signal.bds"
4041

@@ -89,6 +90,9 @@ void main() { // chipseq pipeline starts here
8990

9091
create_sig_trk() // (BETA)
9192

93+
// blacklist-filter peaks
94+
filter_peak()
95+
9296
report()
9397
}
9498

@@ -1237,6 +1241,27 @@ void create_sig_trk( int rep ) {
12371241
}
12381242
}
12391243

1244+
// black list filter and then convert to bigbed (for true replicates only)
1245+
void filter_peak() {
1246+
1247+
// peaks for true replicates
1248+
pc := get_peak_caller()
1249+
if ( get_num_rep() > 1 && peak_pooled.hasKey(pc) ) {
1250+
filt_peak_pooled := \
1251+
blacklist_filter_peak( "narrowPeak", peak_pooled{pc}, (peak_pooled{pc}).dirName(), "peak_pooled" )
1252+
}
1253+
1254+
for (int rep=1; rep<=get_num_rep(); rep++) {
1255+
if ( !peak.hasKey(pc+",$rep") ) continue
1256+
filt_peak := \
1257+
blacklist_filter_peak( "narrowPeak", peak{pc+",$rep"}, (peak{pc+",$rep"}).dirName(), "peak $rep" )
1258+
}
1259+
1260+
wait
1261+
1262+
print( "\n== Done filter_peak_and_convert_to_bigbed()\n" )
1263+
}
1264+
12401265
void report() {
12411266

12421267
wait
@@ -1248,6 +1273,7 @@ void report() {
12481273
html += html_chipseq_QC() // show QC tables and images
12491274

12501275
report( html )
1276+
write_summary_json()
12511277

12521278
print( "\n== Done report()\n" )
12531279
}
@@ -1292,9 +1318,9 @@ string html_chipseq_QC() {
12921318

12931319
html := "<div id='chipseq_qc'>"
12941320

1295-
html += parse_flagstat_to_html( "all", flagstat_headers, flagstat_qcs, flagstat_headers )
1321+
html += parse_flagstat_to_html( "all", flagstat_headers, flagstat_qcs, flagstat_headers, false )
12961322
html += parse_dup_to_html( "all", dup_headers, dup_qcs, dup_headers )
1297-
html += parse_flagstat_to_html( "all, filtered",flagstat_nodup_headers, flagstat_nodup_qcs, flagstat_nodup_headers )
1323+
html += parse_flagstat_to_html( "all, filtered",flagstat_nodup_headers, flagstat_nodup_qcs, flagstat_nodup_headers, true )
12981324
html += parse_pbc_to_html( "all", pbc_headers, pbc_qcs, pbc_headers )
12991325
html += parse_xcor_to_html( "all", xcor_headers, xcor_qcs, xcor_plots, xcor_headers )
13001326

etc/broadPeak.as

Lines changed: 13 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,13 @@
1+
table broadPeak
2+
"BED6+3 Peaks of signal enrichment based on pooled, normalized (interpreted) data."
3+
(
4+
string chrom; "Reference sequence chromosome or scaffold"
5+
uint chromStart; "Start position in chromosome"
6+
uint chromEnd; "End position in chromosome"
7+
string name; "Name given to a region (preferably unique). Use . if no name is assigned."
8+
uint score; "Indicates how dark the peak will be displayed in the browser (0-1000)"
9+
char[1] strand; "+ or - or . for unknown"
10+
float signalValue; "Measurement of average enrichment for the region"
11+
float pValue; "Statistical significance of signal value (-log10). Set to -1 if not used."
12+
float qValue; "Statistical significance with multiple-test correction applied (FDR -log10). Set to -1 if not used."
13+
)

etc/gappedPeak.as

Lines changed: 19 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,19 @@
1+
table gappedPeak
2+
"This format is used to provide called regions of signal enrichment based on pooled, normalized (interpreted) data where the regions may be spliced or incorporate gaps in the genomic sequence. It is a BED12+3 format."
3+
(
4+
string chrom; "Reference sequence chromosome or scaffold"
5+
uint chromStart; "Pseudogene alignment start position"
6+
uint chromEnd; "Pseudogene alignment end position"
7+
string name; "Name of pseudogene"
8+
uint score; "Score of pseudogene with gene (0-1000)"
9+
char[1] strand; "+ or - or . for unknown"
10+
uint thickStart; "Start of where display should be thick (start codon)"
11+
uint thickEnd; "End of where display should be thick (stop codon)"
12+
uint reserved; "Always zero for now"
13+
int blockCount; "Number of blocks"
14+
int[blockCount] blockSizes; "Comma separated list of block sizes"
15+
int[blockCount] chromStarts; "Start positions relative to chromStart"
16+
float signalValue; "Measurement of average enrichment for the region"
17+
float pValue; "Statistical significance of signal value (-log10). Set to -1 if not used."
18+
float qValue; "Statistical significance with multiple-test correction applied (FDR). Set to -1 if not used."
19+
)

etc/narrowPeak.as

Lines changed: 14 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,14 @@
1+
table narrowPeak
2+
"BED6+4 Peaks of signal enrichment based on pooled, normalized (interpreted) data."
3+
(
4+
string chrom; "Reference sequence chromosome or scaffold"
5+
uint chromStart; "Start position in chromosome"
6+
uint chromEnd; "End position in chromosome"
7+
string name; "Name given to a region (preferably unique). Use . if no name is assigned"
8+
uint score; "Indicates how dark the peak will be displayed in the browser (0-1000) "
9+
char[1] strand; "+ or - or . for unknown"
10+
float signalValue; "Measurement of average enrichment for the region"
11+
float pValue; "Statistical significance of signal value (-log10). Set to -1 if not used."
12+
float qValue; "Statistical significance with multiple-test correction applied (FDR -log10). Set to -1 if not used."
13+
int peak; "Point-source called for this peak; 0-based offset from chromStart. Set to -1 if no point-source called."
14+
)

0 commit comments

Comments
 (0)