This code is designed to take fastq files that are produced by the MiSeq and return integration sites, multihits, and chimeras. RData and fasta files are used for data persistance in place of a database, thus allowing massive parallelization and the LSF job submission system on the PMACS HPC
Analysis is started by having the user create the following directory structure:
primaryAnalysisDirectory
├── Data
│ ├── Undetermined_S0_L001_I1_001.fastq.gz
│ ├── Undetermined_S0_L001_R1_001.fastq.gz
│ └── Undetermined_S0_L001_R2_001.fastq.gz
├── processingParams.tsv
├── sampleInfo.tsv
└── vector.fasta
-
Data/Undetermined_S0_L001_*_001.fastq.gzare the fastq files returned by the MiSeq (R1, R2, and I1) At present names are hard-coded and * can only beI1,R1andR2 -
Optional
processingParams.tsvcontains 'dryside' processing parameters, all the same for all samples:-
qualityThreshold -
badQualityBases -
qualitySlidingWindow -
mingDNAis the minimum length of genomic DNA (in nt) that is allowed to be passed onto the alignment step -
minPctIdentis the minimum percent identity that a query sequence has to a putative alignment on the target sequence in order to be considered 'valid' -
maxAlignStartis the maximum number of nucleotides of the query sequence that do not match the putative target sequence before a matched nucleotide is seen -
maxFragLengthis the maximum length of a properly-paired alignment (in nt) -
refGenomeis the reference genome to be used for alignment - this is passed in as a standard text string (ex. 'hg18', 'hg19', 'mm8', 'mm9')After error-correcting and demultiplexing, intSiteCaller trims raw MiSeq reads based on Illumina-issued quality scores.
badQualityBasesis the number of bases belowqualityThreshold, using standard Illumina ASCII Q-score encoding (p.41-2), that can be observed in a window ofqualitySlidingWindowbefore the read is trimmed.
Default file
default_processingParams.tsvis used whenprocessingParams.tsvis not found in the folder. -
-
Required
sampleInfo.tsvcontains 'wetside' sample metadata:aliasis the human-readable sample descriptionlinkerSequenceis the linker sequence as seen in MiSeq read 1. N's indicate the presence of a primerIDbcSeqis the barcode used during sample preparationgenderis either 'm' of 'f' for male/female, respectivelyprimeris the primer sequence as seen in MiSeq read 2ltrBitis the LTR sequence as seen in MiSeq read 2largeLTRFragis 43nt of the LTR sequence as seen from MiSeq read 1vectorSeqis a filepath (either absolute or relative to the primary analysis directory) to the vector sequence in fasta format -- it is encouraged to place the vector sequence directly in the primary analysis directory, although that is not a requirement
-
Required
vector.fastavector sequence file as specified byvectorSeqin sampleInfo.tsv -
make_primaryAnalysisDirectoryByRunId.Rwill generate the directory automatically.
If we need to proccess multiple genomes on the same run we can use non-standard sampleInfo.tsv
and non-standard processingParams.tsv,
we add column refGenome to sampleInfo.tsv and remove this column from processingParams.tsv.
After creating the above directory structure and cd primaryAnalysisDirectory, the following command is issued:
Rscript path/to/intSiteCaller.R
The rest of the processing is fully automated and shouldn't take more than 4 hours to process 1.5e7 raw reads.
After intSiteCaller.R is done, one can examine the attrition table by the command:
Rscript path/to/check_stats.R
and the output is a tab delimited summary table describing each step.
intSiteCaller.R can handle the following optional arguments
-j,--jobID- Unique name by which to identify this intance of intSiteCaller [default: intSiteCallerJob]-c,--codeDir- Directory where intSiteCaller code is stored, can be relative or absolute [default: codeDir as detected by Rscript]-p,--primaryAnalysisDir- Location of primary analysis directory, can be relative or absolute [default: .]-h,--help- Show the help message and exit
- We assume packages intSiteCaller, intSiteUploader, geneTherapyPatientReportMaker installed in the $HOME directory.
- We start by creating a folder
Frances/run20150505and movingGeneTherapy-20150505-sampleInfo.csvto that folder. - Note that the miseq run date
150505match between the folder name and the csv file for consistence.
#1. Prepare the structure of primaryAnalysisDirectory, assuming GeneTherapy-20150505-sampleInfo.csv exists
cd Frances/run20150505
Rscript ~/intSiteCaller/make_primaryAnalysisDirectory.R
#2. Align reads and call sites; wait until all bjobs are done; check exit status
Rscript ~/intSiteCaller/intSiteCaller.R
grep -i exit logs/*.txt #a good run returns nothing
grep -i error logs/*.txt #try to distinguish code error or warning
#3. Check attrition table, make sure the numbers are reasonable and the html or pdf file looks OK
Rscript ~/intSiteCaller/check_stats.R | cut -f1-20
Rscript ~/intSiteCaller/html_stats.R
#4. Upload to database
Rscript ~/intSiteUploader/intSiteUploader.R
#5. Check GTSP numbers, find patient metadate for this run, in this example,
# check_gtsp_patient.R shows the run was for pFR03.
# check_patient_gtsp.R pFR03 will give us all the sets saved in the database for pFR03
# and we save that information as input file to generate report.
Rscript ~/geneTherapyPatientReportMaker/check_gtsp_patient.R #check patient info for this run
Rscript ~/geneTherapyPatientReportMaker/check_patient_gtsp.R #check all patients
Rscript ~/geneTherapyPatientReportMaker/check_patient_gtsp.R pFR03 #output to screen
Rscript ~/geneTherapyPatientReportMaker/check_patient_gtsp.R pFR03 > pFR03.csv #dump to file
#6. Make report for pFR03
Rscript ~/geneTherapyPatientReportMaker/makeGeneTherapyPatientReport.R pFR03.csv
#7. WAS.pFR03.20150617.html will be generated. Today was 20150617.
# If there are more patients in a run, repeat steps 5 and 6.
#8. Generate genomic heatmap
# To be added
#9. The run folder minus the intermediate files should be saved in a central folder for permanant storage.
# To be developed
#10. After all the above steps, the folder run20150505 can be deleted.
#11. Generate UCSC hub
# To be developed
We can remove all files, except Rdata and logs with the script:
Rscript clean_primaryAnalysisDirectory.R
the script asks for confirmation.
This code returns integration sites in two formats. allSites.RData is a
GRanges object that contains a single record for each Illumina read.
sites.final.RData is a GRanges object of dereplicated integration sites
along with a tally of how many reads were seen for each site (irregardless of
sonic breakpoint). The revmap column in sites.final.RData links unique
reads from allSites to dereplicated sites in sites.final.
Multihits are stored in multihitData.RData which is a GRangesList. The
first item in this list is a GRanges object where each record represents a
properly-paired alignment. Individual multihit reads can be identified by
analysing the ID column, which cooresponds to the unique Illumina read
identifier. The second item in multihitData is a GRanges object of
dereplicated multihits, which lists each unique genomic integration site as a
unique record. The revmap column pairs records from multihitData[[1]] to
multihitData[[2]]. The third item is a GRanges object of multihit
clusters. This is still in development.
Chimeras are stored in chimeraData.RData which is a list that contains some
basic chimera frequency statistics and a GRangesList object. Each GRanges
object contains two records, one for the read1 alignment and another for the
read2 alignment
PrimerIDs (if present in the linker sequence) are stored in
primerIDData.RData. This file is a base R list containing a DNAStringSet
and a BStringSet containing the sequences and quality scores of the
primerIDs.
Processing statistics are returned in the stats.RData file. This file
contains a single data.frame. Detail of the specific columns provided in
this dataframe will be added later and can inferred from intSiteLogic.R.
This code is highly dependent on Bioconductor packages for processing DNA data and collapsing/expanding alignments.
The following R packages and their subsesequent dependencies are required for proper operation of intSiteCaller:
ShortReadGenomicRangesrtracklayerBSgenomeargparseigraphBSgenome.*.UCSC.*package cooresponding to reference genomes specified inprocessingParams.csv
Specific versioning analysis has not yet been performed.
Additionally, blat and python are required and must be executable from any path.
blat is available from https://genome.ucsc.edu/FAQ/FAQblat.html#blat3
intSiteCaller confirms the presence of all dependancies and will throw an error if a dependancy is not met.
- Primary read trimming and integration site calling logic is contained in
intSiteLogic.R. - Branching and condensing of PMACS jobs is performed in
programFlow.R - Barcode error correcting logic is performed in
errorCorrectIndices/golay.pyas wrapped byerrorCorrectIndices/processGolay.py. - All code files are checked into the repository.
- Flowcharts will be added to graphically describe the flow of the overall program as well as the flow/logic of individual functions
A sample dataset is included for verification of integration site calling accuracy. The testCases directory contains,
intSiteValidationfolder, which includes the minimal number of files to process a test run,intSiteValidation.digest, a digest(R version of md5) file for theRDatafiles that the test run would produce,intSiteValidation.attr, an attrition table that describes the filtering and alignment process,test_identical_run.R, the script to run the piepline and check the output.
To analyze the test data, run the following commands assuming the current directory is the root of the repository,
cd testCases/intSiteValidation/
Rscript test_identical_run.R
The test should finish in 10 minutes if PMACS is not busy and the output messages should tell whether the pipeline produced the same results as before. Note that this subset of data contains samples with some borderline cases. For example, clone7 samples should all fail, and many of the clone1-clone4 samples should return no multihits or chimeras. The current implementation of the code handles these gracefully.
Run unit tests with:
library(testthat)
test_dir('tests')