Skip to content

Miscalling integration site and breakpoint positions on genome #65

@cnobles

Description

@cnobles

On line 476 of intSiteLogic.R, we've assigned the "tStart" and "tEnd" from the alignment output to the GRanges start and end positions. This is correct most of the time, but the "tStart" and "tEnd" in the psl output file correspond to the edges of the alignment. If there are bases at the beginning and end of the read that do not align, they are not considered when assigning the position.

I think this is a mistake in logic. Instead, the range assignment should be incorporate the flanking nucleotides that may not align, especially on the LTR end. This region is known to have a drop in sequence quality that can lead to base miscalling. These miscalled bases may not be aligned which will change the position of the integration site position calling. A quality metric used for filtering is that the "qStart" must be less than or equal to 5nt (line 519, adjustible with maxAlignStart), which means that the called position could be up to 5 nt away from the actual junction of the LTR.

This is fixed downstream in our reports by standardizing within a 5bp window, but could be refined in the current informatic processing. The proposed change would look something like below:

intSiteLogic.R : Line 476

    ranges=IRanges(start=algns$tStart, end=algns$tEnd)

changed to:

    ranges = ifelse(algns$strand == "+",
        IRanges(
            start = (algns$tStart - algns$qStart + 1),
            end = (algns$end + (algns$qSize - algns$qEnd))),
        IRanges(
            start = (algns$tStart - (algns$qSize - algns$qEnd)),
            end = (algns$tEnd + algns$qStart - 1)))

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions