Question

cluster (only one end overlapping) breakpoints using InteractionSet

0

Entering edit mode

tangming2005 ▴ 190

@tangming2005-6754

Last seen 5 months ago

United States

Hi there,
Thanks for this package. I have been using it to cluster my breakpoints as I took down notes herehttp://crazyhottommy.blogspot.com/2016/03/breakpoints-clustering-for-structural.html
In the post, I used your method to cluster breakpoints which have both ends overlapping.

Now I have another question:

----|------|-------|------|---------------------------------------------------- 
       A                B
---------------------|--------|----------|-------|----------------------------- 
                         C                    D
--------------------------------------------|---------|------------|------|----  
                                                 E                     F

I have a GenomicInteraction object, among pairs, breakpointB overlaps with C, D overlaps with E. I want to group these 3 gi object together and assign the same ID to them, so I know these three are in a complex rearrangment event.

In a toy example, the first three pairs should be grouped as one

library(InteractionSet)
all.regions <- GRanges(rep("chrA",8), IRanges(c(1,4,5,9,10,15,20,22), c(3,6,7,11,13,19,25,27)))
index.1 <- c(1,3,5,7)
index.2 <- c(2,4,6,8) 

gi <- GInteractions(index.1, index.2, all.regions, mode ="strict")

gi

Thanks,
Ming

InteractionSet structural variants • 1.0k views

ADD COMMENT • link updated 7.7 years ago by Aaron Lun ★ 28k • written 7.7 years ago by tangming2005 ▴ 190

score 2 · Accepted Answer · 2016-07-29

Continuing from your example above:

olap1 <- findOverlaps(anchors(gi, "first"), gi) # overlaps with first region
olap2 <- findOverlaps(anchors(gi, "second"), gi) # overlaps with second region
olap <- unique(Hits(c(queryHits(olap1), queryHits(olap2)),
    c(subjectHits(olap1), subjectHits(olap2)),
    length(gi), length(gi), sort.by.query=TRUE)) # combined overlaps

The olap object is contains all pairs of entries in gi that contain one or more overlapping anchor regions. This can then be used to construct a graph as described in C: manipulate bedpe format files, with clustering performed by identifying all connected nodes in the graph.

Note that greater efficiency can be obtained by doing the overlaps to regions(gi) and then expanding the overlaps based on the anchor IDs (i.e., using anchors with id=TRUE). This avoids expanding the GRanges when calling anchors in the findOverlaps calls above, which saves memory and time (as duplicated ranges don't have to be overlapped). However, it requires some care so I would only bother doing it for large objects where speed really mattered.