overlapsAny for elements in GRangesList
2
0
Entering edit mode
maltethodberg ▴ 180
@maltethodberg-9690
Last seen 1 day ago
Denmark

I have a GRanges and a GRangesList object, i.e.:

gr # GRanges

grList # GRangesList

I wish to know what ranges in the GRanges are in each of the ranges in each element in the list, so I thought of calling overlapsAny on each element of the GRangesList like so:

overlapsList <- lapply(grList, function(x) overlapsAny(query=gr, subject=x))

This gives me the desired output, however this is very slow!

Is there a faster of doing the same thing?

granges grangeslist overlapsAny • 2.7k views
ADD COMMENT
0
Entering edit mode

Perhaps you could describe your high-level use case / problem?

ADD REPLY
0
Entering edit mode

I'm doing some enrichment analysis, so I need to annotate a "universe" of ranges with a wide range of other annotations.

ADD REPLY
4
Entering edit mode
@herve-pages-1542
Last seen 1 day ago
Seattle, WA, United States

Hi,

Edit: This solution is incorrect.

This is a situation where you can avoid the lapply loop by using the unlist/relist idiom:

unlisted_grList <- unlist(grList, use.names=FALSE)
relist(overlapsAny(query=unlisted_grList, subject=gr), grList)

Will be 100x faster or more than using lapply().

Correct solution:

hits <- findOverlaps(gr, grList)
m <- matrix(FALSE, nrow=queryLength(hits), ncol=subjectLength(hits))
m[as.matrix(hits)] <- TRUE

See comments below for the details.

Cheers,

H.

ADD COMMENT
0
Entering edit mode

Is that equivalent to the code in the question? He wants the query to be gr, and the subject the grList element.

ADD REPLY
0
Entering edit mode

Correct, this is not equivalent to the original example, here you overlap each element of the GRlist with GR, rather than the other way around.

ADD REPLY
2
Entering edit mode

Oops, you guys are right, my bad. Here is my 2nd attempt. The unlist/relist trick is actually not needed:

library(GenomicRanges)
gr <- GRanges(c("chr1:9-14", "chr3:11-16", "chr1:4-5"))
grList <- GRangesList(GRanges("chr1:5-10"), 
                      GRanges(c("chr2:1-4", "chr1:4-5")),
                      GRanges(),
                      GRanges("chr3:15-17"))

lapply solution (slow):

overlapsList <- lapply(grList, function(x) overlapsAny(query=gr, subject=x))
overlapsList
# [[1]]
# [1]  TRUE FALSE  TRUE
# [[2]]
# [1] FALSE FALSE  TRUE
# [[3]]
# [1] FALSE FALSE FALSE
# [[4]]
# [1] FALSE  TRUE FALSE

Fast solution (no-loop):

hits <- findOverlaps(gr, grList)
m <- matrix(FALSE, nrow=queryLength(hits), ncol=subjectLength(hits))
m[as.matrix(hits)] <- TRUE
m
#       [,1]  [,2]  [,3]  [,4]
# [1,]  TRUE FALSE FALSE FALSE
# [2,] FALSE FALSE FALSE  TRUE
# [3,]  TRUE  TRUE FALSE FALSE

The above matrix has 1 row per range in the GRanges object and 1 column per list element in the GRangesList object.

To get the result as a list of indices into the query:

as(t(hits), "List")
# IntegerList of length 4
# [[1]] 1 3
# [[2]] 3
# [[3]] integer(0)
# [[4]] 2

Hopefully this time I got it right...

Cheers,

H.

PS: I edited my initial answer above accordingly.

ADD REPLY
0
Entering edit mode

Nice solution. This could be implemented as a method on an adjacencyMatrix() generic, which could be promoted from graph, or graph could enhance S4Vectors.

ADD REPLY
0
Entering edit mode

Nice! I was already playing around with findOverlaps, but this is so much faster!

ADD REPLY
0
Entering edit mode

Hervé Pagès how can i fill the overlaps matrix with scores/values ? ie replace TRUE in overlaps matrix with methylation_level in example granges below.

hits_unlist <- findOverlaps(gr_methylation, grl.2_unlist)
scoreMatrix_TrueFalse <- matrix(FALSE, nrow=queryLength(hits_unlist), ncol=subjectLength(hits_unlist))
scoreMatrix_TrueFalse[as.matrix(hits_unlist)] <- TRUE

#read data
length_gr <- 21
gr_methylation = suppressWarnings(GRanges(
  seqnames = Rle(rep(1, length_gr)),
  ranges = IRanges(start=100:120, end=100:120),
  score = 1:length_gr,
  methylation_level = runif(length_gr),
  GC = seq(1, 0, length=length_gr)))


grl.2_unlist <- structure(new("GRanges", seqnames = new("Rle", values = structure(1L, levels = "1", class = "factor"), 
                                        lengths = 24L, elementMetadata = NULL, metadata = list()), 
              ranges = new("IRanges", start = c(99L, 100L, 101L, 102L, 
                                                102L, 103L, 104L, 105L, 103L, 104L, 105L, 106L, 108L, 109L, 
                                                110L, 111L, 112L, 113L, 114L, 115L, 116L, 117L, 118L, 119L
              ), width = c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 
                           1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), NAMES = NULL, 
              elementType = "ANY", elementMetadata = NULL, metadata = list()), 
              strand = new("Rle", values = structure(3L, levels = c("+", 
                                                                    "-", "*"), class = "factor"), lengths = 24L, elementMetadata = NULL, 
                           metadata = list()), seqinfo = new("Seqinfo", seqnames = "1", 
                                                             seqlengths = NA_integer_, is_circular = NA, genome = NA_character_), 
              elementMetadata = new("DFrame", rownames = NULL, nrows = 24L, 
                                    elementType = "ANY", elementMetadata = NULL, metadata = list(), 
                                    listData = list(GeneName = c("GeneA", "GeneA", "GeneA", 
                                                                 "GeneA", "GeneB", "GeneB", "GeneB", "GeneB", "GeneC", 
                                                                 "GeneC", "GeneC", "GeneC", "GeneD", "GeneD", "GeneD", 
                                                                 "GeneD", "GeneE", "GeneE", "GeneE", "GeneE", "GeneF", 
                                                                 "GeneF", "GeneF", "GeneF"), grl.ix = c(1L, 1L, 1L, 1L, 
                                                                                                        2L, 2L, 2L, 2L, 3L, 3L, 3L, 3L, 4L, 4L, 4L, 4L, 5L, 5L, 
                                                                                                        5L, 5L, 6L, 6L, 6L, 6L), grl.iix = c(1L, 2L, 3L, 4L, 
                                                                                                                                             1L, 2L, 3L, 4L, 1L, 2L, 3L, 4L, 1L, 2L, 3L, 4L, 1L, 2L, 
                                                                                                                                             3L, 4L, 1L, 2L, 3L, 4L))), elementType = "ANY", metadata = list()))

thanks in advance

ADD REPLY
0
Entering edit mode
@michael-lawrence-3846
Last seen 3.0 years ago
United States

Since your GRangesList is short (relative to something like a transcriptome), it is probably much faster to use a SimpleGenomicRangesList instead of GRangesList (a compressed list). See `?GenomicRangesList`.

 

ADD COMMENT

Login before adding your answer.

Traffic: 606 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6