Question

How to implement element-wise-overlaps with iteration for two different bed files in R?

0

Entering edit mode

Jurat Shahidin ▴ 80

@jurat-shahidin-9488

Last seen 4.7 years ago

Chicago, IL, USA

Hi:

I want to implement element-wise-overlaps with iteration for two different bed files. Where I read two bed files as set of GRanges objects such as:

library(rtracklayer)
bed1 <- import("foo.bed")
bed2 <- import("bar.bed")

then callling findOverlaps():

overlap[1] <- findOverlaps(bed1[1], bed.2)

I have tried for loop for iterating each row of bed.1 versus bed.2 and discovering overlapped interval. But, code was not efficient. Anyone can provide simple R script for iterating each peak interval from bed.1 versus bed.2 to detect overlapped regions?? Thanks a lot

r findoverlaps iterators • 2.0k views

ADD COMMENT • link updated 8.9 years ago by Hervé Pagès 16k • written 8.9 years ago by Jurat Shahidin ▴ 80

score 5 · Accepted Answer · 2016-01-14

5

Entering edit mode

Hervé Pagès 16k

@herve-pages-1542

Last seen 4 days ago

Seattle, WA, United States

Hi,

element-wise-overlaps would be to do

findOverlaps(bed1[i], bed2[i])

for all i. That implies that bed1 and bed2 have the same length. That doesn't look like what you are trying to do.

To find the overlaps of each element in bed1 with all elements in bed2, you could do

findOverlaps(bed1[i], bed2)

for all i in 1:length(bed1). However, as you found out, doing this in a loop is inefficient. It is much more efficient to do:

as(findOverlaps(bed1, bed2), "List")

This will return an IntegerList object parallel to bed1 (i.e. of the same length as bed1). This IntegerList object can be manipulated like an ordinary list of integer vectors.

Cheers,

H.

ADD COMMENT • link 8.9 years ago Hervé Pagès 16k

0

Entering edit mode

@ Hervé Pagès : Thanks a lot for your answer.

ADD REPLY • link 8.9 years ago Jurat Shahidin ▴ 80

0

Entering edit mode

@Hervé Pagès: sorry for this repeated question. I am trying to evaluate every genomic interval from bed.1 versus all element in bed.2 and I am gonna evaluate combined p-value of overlapped genomic regions by Fisher' exact test. based on your answer, I have coded like this:

f1<- function(x, y){
  x <- GRanges(x)
  y <- GRanges(y)
  for(i in 1:length(x)){
     queryBed <- x[i]
     overlap <- findOverlaps(queryBed, y, minoverlap = 10, algorithm = "intervaltree")
  }
  return(overlap)
}

but, certainly it is not good idea, R-studio still executing the code and haven't returned overlapped interval. Is there any alternative solution for that ? Do you know any package that processing multiple bed files in parallel ? many thanks

ADD REPLY • link 8.9 years ago Jurat Shahidin ▴ 80

1

Entering edit mode

First of all the f1 function doesn't make much sense: the value of overlap will be overwritten at each iteration so the function will only return the result of the last call to findOverlaps().

Other than that: yes we both know that for(i in 1:length(x)) findOverlaps(x[i], y) is very inefficient and I gave you a replacement for it. Did you try it? Did you run into some problems while trying it?

About your last question: "Do you know any package that processing multiple bed files in parallel?"

Just to clarify: the inefficiency you're seeing only involves 2 bed files and is easy to work around. I'm not sure parallelizing would help much here but you could try (with either parallel::mclapply() or BiocParallel::bplapply()). I'm pretty confident it won't be as fast as as(findOverlaps(x, y), "List") though. In any case, I wouldn't call this kind of parallelization "processing multiple bed files in parallel", that's misleading. The files are not processed in parallel. Anyway and FWIW the closest thing to a "package for processing multiple bed files in parallel" is GenomicFiles. Be aware that it's low-level and pretty generic: you have to learn how to use it and will need to write the code that implements the particular processing you want to apply to each file.

One last thing: in the current version of BioC (3.2), the algorithm argument is defunct. I highly recommend that you update your installation to use the current BioC. Oldest versions are not supported.

Cheers,

H.

ADD REPLY • link 8.9 years ago Hervé Pagès 16k

0

Entering edit mode

@ Hervé Pagès: really appreciated your kind favor so far !!

ADD REPLY • link 8.9 years ago Jurat Shahidin ▴ 80

0

Entering edit mode

Hi,

I tried your solution it works. But, when I do like this:

x <- as(import.bed(bed.1), "GRanges")
y <- as(import.bed(bed.2), "GRanges")

res <- as(findOverlaps(x,y), "list")
qh <- x[queryHits(res)]       # gave me error
sh <- y[subjectHits(res)]     # gave me error

instead, I tried like this: (I am still insist on row wise operation for GRanges objects)

x <- import.bed(bed.1)
y <- as(import.bed(bed.2), "GRanges")

o <- findOverlaps(query= sapply(x, function(i) { as(i, "GRanges")}), subject=y, minOverlaps=10)

res <- data.frame(query[queryHits(o)], subject[subjectHits(o)])

Could you evaluate the feasibility of my code? Thanks a lot

ADD REPLY • link 8.9 years ago Jurat Shahidin ▴ 80

0

Entering edit mode

Hi,

I was suggesting that you coerce to List, not to list, i.e. that you do res <- as(findOverlaps(x,y), "List"), not res <- as(findOverlaps(x,y), "list"). The former is more efficient. Anyway the result is not a Hits object so it doesn't make sense to call queryHits() or subjectHits() on it.

I'm not sure what your code is trying to do. If all you want to do is store the hits in a data.frame, then just call as.data.frame() on the Hits object returned by findOverlaps():

as.data.frame(findOverlaps(x, y), minOverlaps=10)

You don't need any loop for that.

H.

ADD REPLY • link 8.9 years ago Hervé Pagès 16k