-
Notifications
You must be signed in to change notification settings - Fork 6
Description
On line 476 of intSiteLogic.R, we've assigned the "tStart" and "tEnd" from the alignment output to the GRanges start and end positions. This is correct most of the time, but the "tStart" and "tEnd" in the psl output file correspond to the edges of the alignment. If there are bases at the beginning and end of the read that do not align, they are not considered when assigning the position.
I think this is a mistake in logic. Instead, the range assignment should be incorporate the flanking nucleotides that may not align, especially on the LTR end. This region is known to have a drop in sequence quality that can lead to base miscalling. These miscalled bases may not be aligned which will change the position of the integration site position calling. A quality metric used for filtering is that the "qStart" must be less than or equal to 5nt (line 519, adjustible with maxAlignStart), which means that the called position could be up to 5 nt away from the actual junction of the LTR.
This is fixed downstream in our reports by standardizing within a 5bp window, but could be refined in the current informatic processing. The proposed change would look something like below:
intSiteLogic.R : Line 476
ranges=IRanges(start=algns$tStart, end=algns$tEnd)
changed to:
ranges = ifelse(algns$strand == "+",
IRanges(
start = (algns$tStart - algns$qStart + 1),
end = (algns$end + (algns$qSize - algns$qEnd))),
IRanges(
start = (algns$tStart - (algns$qSize - algns$qEnd)),
end = (algns$tEnd + algns$qStart - 1)))