Find overlap for more than 2 GRanges object simultaneously (reproducible example is updated) ?
4
4
Entering edit mode
@jurat-shahidin-9488
Last seen 4.7 years ago
Chicago, IL, USA

Hi,

I know findOverlaps() from GenomicRanges package does pretty good job for finding overlapped regions.I have studied whole vignette of GenomicRanges, BiocParallel pakages. findOverlaps function is very well done, but I need element wise operation across several GRanges objects simultaneously. I come up following reproducible example and put my desire output as well.

Objective: find out overlapped regions across multiple GRanges objects simultanously (a.k.a, element-wise), aim to provide co-localization test and to save discarded enriched regions that was discarded in single trail, but discarded enriched regions will be saved by combining evidence of multiple Chip-seq experiment sample (find out its overlapped regions across across multiple GRanges objects in parallel). 

for example:

Step 1: setup

library(GenomicRanges)
a <- GRanges(
  seqnames=Rle(c("chr1", "chr2", "chr3", "chr4"), c(3, 2, 1, 2)),
  ranges=IRanges(seq(1, by=5, len=8), seq(4, by=5, len=8)),
  rangeName=letters[seq(1:8)],
  score=sample(1:20, 8, replace = FALSE))

b <- GRanges(
  seqnames=Rle(c("chr1", "chr2", "chr3","chr4"), c(4, 1, 2, 1)),
  ranges=IRanges(seq(2, by=6, len=8), seq(5, by=6, len=8)),
  rangeName=letters[seq(1:8)],
  score=sample(1:20, 8, replace = FALSE))

c <- GRanges(
  seqnames=Rle(c("chr1", "chr2", "chr3","chr4"), c(2, 4, 1,1)), 
  ranges=IRanges(seq(4, by=4, len=8), seq(7, by=4, len=8)),
  rangeName=letters[seq(1:8)],
  score=sample(1:15, 8, replace = FALSE))

Step 2: I want to do overlap operation simultaneously such as:

# each iteration, only take one ranges out of multiple ranges in
# querySample (a.k.a, first GRanges objects)

queryRange <- a[1]
ov_b <- b[subjectHits(findOverlaps(a[1], b))]
ov_c <- c[subjectHits(findOverlaps(a[1], c))]

Step 3: then, I want to keep only one overlapped ranges from ov_b, ov_c (if there are multiple intersection were found), such as:

ov_b_keepOne <- ov_b[which.max(ov_b$score)]  # if multiple overlapped
                                             # found, only keep most
                                             # significant one.
ov_c_KeepOne <- ov_c[which.max(ov_c$score)]

Step 4: then, use chisq.test to find out combined pvalue, such as:

comb.pval <- chisq.test(c(queryRange$score, ov_b$score, ov_c$score))$p.value

Step 5: then, given threshold value gamma=1e-8,  (for example, comb.pvalue > gamma), then, execute below:

res <- GRangesList('querySample'=queryRange,
                   'targetSample_1'=ov_b_keepOne,
                   'targetSample_2'=ov_c_KeepOne)

Step 6: finally, this is my expected results :

res.df <- as.data.frame(res)
write.table(res.df, file="Confirmed.bed")

Objective: only take one ranges (a.k.a, each element of GRanges objects) out of multiple ranges, to find its overlapped ranges from multiple GRanges objects simultaneously (seems bplapply function can does this, but I haven't familiar with batch processing in R). I have bottleneck with this problem for a while, still not solved efficiently, any idea or possible approach to save my effort to efficiently develop my packages. Please point me out what should I do? Thanks a lot

Best regards:

Jurat

findoverlaps granges • 9.3k views
ADD COMMENT
4
Entering edit mode
@martin-morgan-1513
Last seen 5 months ago
United States

I updated your question to make the formatting clearer, and to include steps so that we can refer to the individual steps involved.

From the discussion below, an efficient starting point for this work flow seems to be

## helper function: identify the overlapping range satisfying FUN
keepone <- function(gr, hitlist, FUN=which.max) {
    idx0 <- as(FUN(extractList(gr$score, hitlist)), "List")
    idx1 <- unlist(extractList(seq_along(gr), hitlist)[idx0])
    ## FIXME: what about NA's when there are no matching ranges?
    gr[idx1]
}

ab_hitlist <- as(findOverlaps(a, b), "List")
ac_hitlist <- as(findOverlaps(a, c), "List")

test <- (lengths(ab_hitlist) > 0) + (lengths(ac_hitlist) > 0)
keep <- test >= minNumOlp_Required
a = a[keep]
ab_hitlist = ab_hitlist[keep]
ac_hitlist = ac_hitlist[keep]

b_one <- keepone(b, ab_hitlist)
c_one <- keepone(c, ac_hitlist)

comb.pval <-
    mapply(function(a, b, c) chisq.test(c(a, b, c))$p.value,
           a$score, b_one$score, c_one$score)

Updating each step

The place I want to start is with step 2, which I will replace with

ab_hitlist <- as(findOverlaps(a, b), "List")
ac_hitlist <- as(findOverlaps(a, c), "List")

for the reasons below.

I will replace step 4 with

comb.pval  <- mapply(function(a, b, c) {
    chisq.test(c(a, b, c))$p.value
}, a$score, max(extractList(b$score, ab_hitlist)), max(extractList(c$score, ac_hitlist)))

Step 3 is to identify the overlapping range with maximum score. I did this as follows:

keepone <- function(gr, hitlist) {
    idx0 <- as(which.max(extractList(gr$score, hitlist)), "List")
    idx1 <- unlist(extractList(seq_along(gr), hitlist)[idx0])
    ## FIXME: what about NA's when there are no matching ranges?
    gr[idx1]
}
b_one <- keepone(b, ab_hitlist)
c_one <- keepone(c, ac_hitlist)

But now that the specification for step 4 has been clarified as the maximum score of the overlapping ranges, and we have the overlapping range, step 4 is

comb.pval  <- mapply(function(a, b, c) chisq.test(c(a, b, c))$p.value,
                     a$score, b_one$score, c_one$score) 

Justification for updated step 2:

It is very inefficient to calculate overlaps between 1 range and many ranges. So 'invert' the problem and find all overlaps

ab_hit <- findOverlaps(a, b)
ac_hit <- findOverlaps(a, c)

For convenience, split the hits into a list, where each element of the list corresponds to the overlaps for a different query

ab_hitlist <- as(ab_hit, "List")

For your example, this is

> ab_hitlist
IntegerList of length 8
[[1]] 1 2 3 4
[[2]] 1 2 3 4
[[3]] 1 2 3 4
[[4]] 5
[[5]] 5
[[6]] 6 7
[[7]] 8
[[8]] 8

Convince yourself that ab_hitlist[[1]]

> ab_hitlist[[1]]
[1] 1 2 3 4

is the same as subjectHits(findOverlaps(a[1], b)), and so on for a[2], a[3], a[4]. The reason for this change is that taking ranges one-at-a-time scales linearly

> library(microbenchmark)
> f <- function(a, b)
+     lapply(seq_along(a), function(i) subjectHits(findOverlaps(a[i], b)))
> g <- function(a, b)
+     as(findOverlaps(a, b), "List")
> a0 <- a[1:4]; a2 <- c(a, a); a4 <- c(a2, a2); a8 <- c(a4, a4); a16 <- c(a8, a8)
> microbenchmark(f(a0, b), f(a, b), f(a2, b), times=10)
Unit: milliseconds
     expr       min        lq      mean    median        uq       max neval cld
 f(a0, b)  75.21582  76.07495  77.63697  77.76686  78.44264  80.56188    10 a  
  f(a, b) 152.56619 154.25827 157.34376 156.14110 158.94034 169.31483    10  b 
 f(a2, b) 309.08562 309.42790 313.71373 312.01247 316.88869 323.32852    10   c
> microbenchmark(g(a0, b), g(a, b), g(a2, b), g(a4, b), g(a8, b), g(a16, b),
+                times=10)
Unit: milliseconds
      expr      min       lq     mean   median       uq       max neval cld
  g(a0, b) 17.24473 17.32625 17.93150 17.67085 18.01294  20.22646    10   a
   g(a, b) 17.33038 17.50243 18.09587 17.73340 18.19032  20.01569    10   a
  g(a2, b) 20.46892 20.52388 21.30909 21.11872 21.66259  23.64361    10   a
  g(a4, b) 20.51881 20.68018 21.32741 21.17813 21.65589  23.31553    10   a
  g(a8, b) 20.55628 20.64502 21.14069 20.86133 21.33573  22.41963    10   a
 g(a16, b) 20.50725 20.82241 34.97887 21.24104 21.45764 158.24840    10   a

whereas the same is not true for creating a hits object (the scaling is not evident from the small example data).

Justification for updated step 4.

The extractList() function will take a vector-like object as a first argument, and reshape it according to the indexes of the second argument. It is relatively amazingly useful to see that

> extractList(b$score, ab_hitlist)
IntegerList of length 8
[[1]] 12 19 11 6
[[2]] 12 19 11 6
[[3]] 12 19 11 6
[[4]] 7
[[5]] 7
[[6]] 10 20
[[7]] 14
[[8]] 14

extracts the scores and reshapes them into the correct geometry of the hits.

The calculation of p-values applies chisq.test() to the element-wise 'parallel' vectors a$score, and from the discussion in the comments below to max(extractList(b$score, ab_hitlist)), and max(extractList(c$score, ac_hitlist)). So

comb.pval <- mapply(function(a, b, c) {
    chisq.test(c(a, b, c))$p.value
}, a$score, max(extractList(b$score, ab_hitlist)), max(extractList(c$score, ac_hitlist)))

The result is

> comb.pval
[1] 1.380288e-04 1.290620e-04 3.281973e-05 5.494160e-01 4.821936e-01
[6] 1.813352e-01 3.805041e-03 8.590223e-02

All other operations in can be 'vectorized' to be efficient, but chisq.test() cannot be (I bet there's a function that does do this in a vectorized way?). It is therefore a candidate for parallel evaluation, but I would only do this at the end, when it becomes apparent that this is a rate-limiting step.

library(BiocParallel)
comb.pval <- bpmapply(function(a, b, c) {
    chisq.test(c(a, b, c))$p.value
}, a$score, max(extractList(b$score, ab_hitlist)), max(extractList(c$score, ac_hitlist)))

Justification for updated step 3.

which.max(extractList(b$score, ab_hitlist))

returns a plain vector of indexes in to each of the list elements of the hitlist. Coercing this to a List

idx0 <- as(which.max(extractList(gr$score, hitlist)), "List")

restores the 'geometry' of the vector, so that we have a list of length(ab_hitline). We then create an index into the grang that we're interested in, cast it in the appropriate geometry, and subset with the index

idx1 <- unlist(extractList(seq_along(b), ab_hitlist)[idx0])

These are the indexes into the original ranges, so we can retrieve them with

b[idx1]

A little complicated, so we implement this as a small helper function

keepone <- function(gr, hitlist) {
    idx0 <- as(which.max(extractList(gr$score, hitlist)), "List")
    idx1 <- unlist(extractList(seq_along(gr), hitlist)[idx0])
    gr[idx1]
}
ADD COMMENT
1
Entering edit mode

Can you REPLY to this comment with a clear description of how you want to drop regions? In an early comment you said "I assume that some features (a.k.a, enriched regions) may possibly discarded in single ChIP-seq trial, so I want to save that discarded regions (these discarded regions may possibly play significant roles on TFBS and gene regulations ) that discarded in single trial by using combined evidence of multiple ChIP- seq experiment". To me it sounds like, after you calculate the p.values, you want to set a threshold, and simply remove the ranges that do not reach that threshold

keep <- comb.pval < 0.01
a[keep]
b_one[keep]
c_one[keep]
ADD REPLY
0
Entering edit mode

we come up method  before applying chisq.test (we set user defined parameter minNumOlp_Required).
let's say, if we have three sample (a,b,c), and I set param as minNumOlp_Required =2L(minimum number of required overlap, default value as 1L),then, number of overlapped regions (query regions a[1] is also included) as: 

if c_one[1] <- NA

numOvlp -> sum (length(a[1]), length(b_one[1])) =2L

OR we have exactly 3 overlap regions (a[1] is also included)

numOvlp -> sum (length(a[1]), length(b_one[1]), length(c_one[1])) =3L

both cases  numOvlp >= minNumOlp_Required is TRUE, we can apply chisq.test for score of a[1], b_one[1], c_one[1] and continue  compare comb.pval with param gamma; otherwise, if not true, a[1], b_one[1], c_one[1] are discarded.

if I set param as minNumOlp_Required=3L, where assume c_one[1] <- NA
then numOvlp < minNumOlp_Required, and no need to apply chisq.test, a[1], b_one[1], c_one[1] are discarded.

 

ADD REPLY
1
Entering edit mode

The function lengths(ab_hitlist) is a more efficient version of sapply(ab_hitlist, length). It seems like you are testing the length of each hitlist

test <- (lengths(ab_hitlist) > 0) + (lengths(ac_hitlist) > 0)

and then keeping the relevant ranges from a and hitlist

keep <- test >= minNumOlp_Required
a = a[keep]
ab_hitlist = ab_hitlist[keep]
ac_hitlist = ac_hitlist[keep]

Is that correct?

ADD REPLY
0
Entering edit mode

Yes, that's right. If I found sufficient overlapped regions (for example two regions overlapped from b, c), then test is passed (user defined param is crucial) and I can apply chisq.test, otherwise test is failed (no sufficient overlapped regions were found), all are discarded. If test is succeed, I can go further analyzing step after all. 

I got interesting point about lengths from S4Vectos , I used to use length from base package. However, result of test are also List now. 

ADD REPLY
1
Entering edit mode

I have summarized the work flow we have discussed. There is still some work to do, and it is likely that there are more efficient / elegant ways to do what we have done, but I hope this is a useful start. Best of luck!

ADD REPLY
0
Entering edit mode

Dear Martin Morgan:

I am grateful for your answer. but regarding overlapped regions across multiple GRanges objects, I must preserve its metadata column (keeping score column are crucial to go next analyzing phase). I am convinced with you about inefficiency of taking only one ranges out of multiple ranges in first GRanges objects. Here is one of most important things I want to implement on my packages: assume that dataset were collected from ChIP-Seq experiment for mouse tumor cell, we did conduct 3 time ChIP-seq trail for same cell, and in each individual experiment, some existing features from first trial may possibly discarded and some new features may possibly added in next trail. So, Why I am so insist to take only one ranges out of multiple ? Because I assume that some features (a.k.a, enriched regions) may possibly discarded in single ChIP-seq trail, so I want to save that discarded regions (these discarded regions may possibly play significant roles on TFBS and gene regulations ) that discarded in single trail by using combined evidence of multiple ChIP- seq experiment ( to find out overlapped regions across other GRanges objects simultaneously in parallel). Plus, I have updated my original post

ADD REPLY
1
Entering edit mode

Good, I'm glad that you understand the efficiency. I have updated the answer to summarize this step, and also to summarize step 4. Please click the 'ADD COMMENT' button (i.e., start a new comment thread) if you are following step 4.

Yes, I know that you are interested in the metadata; this will be the next part of the solution.

I understand that you want to somehow process each range in a. My strategy will be to calculate for all a, and then to add or remove results as needed. I think you want to calculate a[1], then maybe discard it, etc. I will eventually add to my answer to show how to add / remove ranges from a.

ADD REPLY
0
Entering edit mode

Dear Martin Morgan:

I sketched out your solution in R function and it works very fast, and I am surely follow your strategy and expect to see your continuous approach.  However, I have this doubt before you present further step, for example:

Regarding ab_hitlist[[1]] ; ac_hitlist[[1]] : is that possible to expand them all in step 2 (certainly there were multiple intersected regions)? may be this is not necessarily explore them all here. Yes, I want to see your next solution . 

Great, after we got comb.pvalue by using chisq.test and to compare it with threshold param gamma (default value as 1e-10), if comb.pval is greater than threshold param, then a[1], ab_hitlist[[1]] ; ac_hitlist[[1]] (only keep one from each that has siginificant score value) are saved, otherwise all are discarded. In this case, we need to expand ab_hitlist[[1]] ,ac_hitlist[[1]] and extract only one from each (the one with highest score value). I am wondering how this can be done without expand them?

I am quite impressed with efficiency of using as(f, "List"), and appreciated much to you giving me this efficient solution. I am expecting to see your continuous approach. Many thanks to your great effort.

Sincerely:

Jurat

ADD REPLY
1
Entering edit mode

When you say "Regarding ab_hitlist[[1]] ; ac_hitlist[[1]] : is that possible to expand them all in step 2" I think you mean that you want to see the b ranges that a overlaps,

GRanges object with 4 ranges and 2 metadata columns:
      seqnames    ranges strand |   rangeName     score
         <Rle> <IRanges>  <Rle> | <character> <integer>
  [1]     chr1   [1, 11]      * |           a        13
  [2]     chr1   [2, 12]      * |           b        16
  [3]     chr1   [3, 13]      * |           c        18
  [4]     chr1   [4, 14]      * |           d         2
  -------
  seqinfo: 4 sequences from an unspecified genome; no seqlengths

Is that right?

ADD REPLY
0
Entering edit mode

Yes, that's right.

ADD REPLY
1
Entering edit mode

I'm not sure that you need these ranges (we can identify the range with maximum score without creating the subset of overlapping ranges themselves), but they are available as extractList(b, ab_hitlist)[[1]].

ADD REPLY
0
Entering edit mode

now, I realize that before applying chisq.test , I must expand ab_hitlist[[1]] ; ac_hitlist[[1]], and extract only one from each (the one with highest score value).

for example: ab_hitlist is just transitional result: 

ab_hitlist
IntegerList of length 8
[[1]] 1 2 3 4
[[2]] 1 2 3 4
[[3]] 1 2 3 4
[[4]] 5
[[5]] 5
[[6]] 6 7
[[7]] 8
[[8]] 8

then, we expand ab_hitlist and extract only one ranges from each row (if multiple intersection found, otherwise kept single hit as it is), and desired ab_hitlist will be like this:

ab_hitlist
IntegerList of length 8
[[1]] 3 
[[2]] 3 
[[3]] 3 
[[4]] 5
[[5]] 5
[[6]] 6
[[7]] 8
[[8]] 8

then, we can do chisq.test upon score column to get comb.pvalue

 

ADD REPLY
1
Entering edit mode

The maximum score for each element of a is

> max(extractList(b$score, ab_hitlist))
[1] 18 18 18  9  9 12  6  6

Is that what you want for the chisq.test in step 4?

ADD REPLY
0
Entering edit mode

Yes, that's exactly what I expected to apply chisq.test in step 4. 

ADD REPLY
0
Entering edit mode

Regarding  max(extractList(b$score, ab_hitlist)), how can I extract corresponding overlapped regions from b (I assumed that first track back its idex list from ab_hitlist, then extract corresponding regions)? I gave it try for how to retrieve back b whose score match with  max(extractList(b$score, ab_hitlist)), but I didn't get right one Plus, I followed your strategy and I would like to see your further step for how to add/ remove ranges from b.

ADD REPLY
1
Entering edit mode

I added step 3, which lead to a revision of step 4. Is step 3 providing you what you want for the matching ranges in b and c?

ADD REPLY
0
Entering edit mode

Yes, that's exactly matched what I expected, in particular, implementation of helper function does pretty good job here. 

ADD REPLY
0
Entering edit mode

Dear Martin Morgan:

Regarding missing values from ab_hitlist (consider the case that there are bunch of non overlapped regions found), how can I fix this issue ? I think the error comes from this line of code :

idx1 <- unlist(extractList(seq_along(gr), hitlist)[idx0])

I intentionally changed the sample a,b,c that in order to let them not overlapped entirely (I mean there is possible case that several missing values are occured.)

However, I got this error message when I have multiple missing values from ab_hitlist

Error in NSBS(i, x, exact = exact, upperBoundIsStrict = !allow.append) : 
  subscript contains NAs or out-of-bounds indices

Is there any alternative way to avoid this bug ? Thank you

Sincerely:

Juray

ADD REPLY
1
Entering edit mode

One possibility is to remove the ranges from a, ab_hitlist, ac_hitlist with length 0 as a 'pre-extract-list' step. Maybe there is a better overall solution... Probably it would help to have a simple worked example so that we are sure about what the problem is.

ADD REPLY
0
Entering edit mode

Yes, I updated original post with more robust reproducible example for a,b,c (the case for addressing NA values from two or more hitlist). However, I used ab_hitlist <- Filter(length, ab_hitlist), and it works well now, error is removed. I figured out that. Thank you

 

 

ADD REPLY
2
Entering edit mode
@michael-lawrence-3846
Last seen 3.0 years ago
United States

I think you want to look into the BiocParallel package. There is also GenomicFiles, which has specific functionality for processing files in parallel.

It would look something like this:

hitsList <- bplapply(grs[-1L], function(gr) findOverlaps(grs[[1L]], gr))

If you need to do things pairwise, look into expand.grid() and bpmapply(). Once you compute the hits in parallel, and you have a list of Hits, do this to intersect them:

hits <- Reduce(intersect, hitsList)

 

ADD COMMENT
0
Entering edit mode

Hi,

I couldn't apply your solution in my case. Could you give me reproducible example to test your code and try to understand how it works. Thanks a lot

ADD REPLY
1
Entering edit mode

It's pretty hard for me to know what you mean; maybe you could EDIT your original question with a simple example, based on GRanges(). For instance, I did

library(GenomicRanges)

a <- GRanges("A", IRanges(c(10, 20, 30), width=5))
b <- GRanges("A", IRanges(c(21, 31, 41), width=5))
c <- GRanges("A", IRanges(c(32, 42, 52), width=5))

Maybe you're interested in the intersection of the ranges of a, b, and c. I found the intersection between the overlap of b and a

a <- findOverlaps(b, a)
ba_gr <- pintersect(b[queryHits(ba)], a[subjectHits(ba)])

and then the intersection between the overlap of c and these ranges

cba <- findOverlaps(c, ba_gr)
cba_gr <- pintersect(c[queryHits(cba)], ba_gr[subjectHits(cba)])

The end result is

> cba_gr
GRanges object with 1 range and 1 metadata column:
      seqnames    ranges strand |       hit
         <Rle> <IRanges>  <Rle> | <logical>
  [1]        A  [32, 34]      * |      TRUE
  -------
  seqinfo: 1 sequence from an unspecified genome; no seqlengths

What would you like your result to be, for these example GRanges()?

ADD REPLY
0
Entering edit mode

Dear Martin Morgan:

I am very thankful to your effort on my question so far (answer from stackoverflow), truly appreciated. Yes, I have updated my question, redefine my question entirely, also I come up reproducible example and desired results. finding overlap, using findOverlaps is too generic for my case. I have tested dummy code of bplapply and I realized it might solve my problem. I will be grateful if you give me any idea how can I get out this headache. Thanks a lot. 

ADD REPLY
2
Entering edit mode
Ou, Jianhong ★ 1.3k
@ou-jianhong-4539
Last seen 12 days ago
United States

Did you tried ChIPpeakAnno::findOverlapsOfPeaks like this?

library(ChIPpeakAnno)
ol <- findOverlapsOfPeaks(a, b, c, maxgap=0)
olpeaks <- ol$peaklist[["a///b///c"]]
all <- GRangesList(a, b, c)
all <- mapply(function(.ele, .name){names(.ele)<- paste(.name, 1:length(.ele), sep="__"); .ele}, all, c("a", "b", "c"))
all <- unlist(GRangesList(all))
pval <- sapply(olpeaks$peakNames, function(.ele) chisq.test(all[.ele]$score)$p.value) ##here should be changed to keep the best one for each peak list
...
​
ADD COMMENT
2
Entering edit mode
@dimitris-polychronopoulos-9192
Last seen 7.4 years ago
United Kingdom

You could also have a look and try regioneR and LOLA bioconductor packages.

0
Entering edit mode

Thank you. I will look at it. 

ADD REPLY
0
Entering edit mode

Hi,

Thanks for your help, apparently LOLA package gives me a lot of idea to address my problem. Thanks again for your favor

ADD REPLY

Login before adding your answer.

Traffic: 459 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6