Question: How to implement element-wise-overlaps with iteration for two different bed files in R?
0
gravatar for Jurat Shahidin
3.8 years ago by
Chicago, IL, USA
Jurat Shahidin70 wrote:

Hi:

I want to implement element-wise-overlaps with iteration for two different bed files. Where I read two bed files as set of GRanges objects such as:

library(rtracklayer)
bed1 <- import("foo.bed")
bed2 <- import("bar.bed")

then callling findOverlaps():

overlap[1] <- findOverlaps(bed1[1], bed.2)

I have tried for loop for iterating each row of bed.1 versus bed.2 and discovering overlapped interval. But, code was not efficient. Anyone can provide simple R script for iterating each peak interval from bed.1 versus bed.2 to detect overlapped regions?? Thanks a lot

 

R findoverlaps iterators • 844 views
ADD COMMENTlink modified 3.8 years ago by Hervé Pagès ♦♦ 14k • written 3.8 years ago by Jurat Shahidin70
Answer: How to implement element-wise-overlaps with iteration for two different bed file
5
gravatar for Hervé Pagès
3.8 years ago by
Hervé Pagès ♦♦ 14k
United States
Hervé Pagès ♦♦ 14k wrote:

Hi,

element-wise-overlaps would be to do

findOverlaps(bed1[i], bed2[i])

for all i. That implies that bed1 and bed2 have the same length. That doesn't look like what you are trying to do.

To find the overlaps of each element in bed1 with all elements in bed2, you could do

findOverlaps(bed1[i], bed2)

for all i in 1:length(bed1). However, as you found out, doing this in a loop is inefficient. It is much more efficient to do:

as(findOverlaps(bed1, bed2), "List")

This will return an IntegerList object parallel to bed1 (i.e. of the same length as bed1). This IntegerList object can be manipulated like an ordinary list of integer vectors.

Cheers,

H.

ADD COMMENTlink written 3.8 years ago by Hervé Pagès ♦♦ 14k

@ Hervé Pagès : Thanks a lot for your answer.

ADD REPLYlink written 3.8 years ago by Jurat Shahidin70

@Hervé Pagès: sorry for this repeated question.  I am trying to evaluate every genomic interval from bed.1 versus all element in bed.2 and I am gonna evaluate combined p-value of overlapped genomic regions by Fisher' exact test. based on your answer, I have coded like this:

f1<- function(x, y){
  x <- GRanges(x)
  y <- GRanges(y)
  for(i in 1:length(x)){
     queryBed <- x[i]
     overlap <- findOverlaps(queryBed, y, minoverlap = 10, algorithm = "intervaltree")
  }
  return(overlap)
}

 

but, certainly it is not good idea, R-studio still executing the code and haven't returned overlapped interval. Is there any alternative solution for that ? Do you know any package that processing multiple bed files in parallel ? many thanks

ADD REPLYlink written 3.8 years ago by Jurat Shahidin70
1

First of all the f1 function doesn't make much sense: the value of overlap will be overwritten at each iteration so the function will only return the result of the last call to findOverlaps().

Other than that: yes we both know that for(i in 1:length(x)) findOverlaps(x[i], y) is very inefficient and I gave you a replacement for it. Did you try it? Did you run into some problems while trying it?

About your last question: "Do you know any package that processing multiple bed files in parallel?"

Just to clarify: the inefficiency you're seeing only involves 2 bed files and is easy to work around. I'm not sure parallelizing would help much here but you could try (with either parallel::mclapply() or BiocParallel::bplapply()). I'm pretty confident it won't be as fast as as(findOverlaps(x, y), "List") though. In any case, I wouldn't call this kind of parallelization "processing multiple bed files in parallel", that's misleading. The files are not processed in parallel. Anyway and FWIW the closest thing to a "package for processing multiple bed files in parallel" is GenomicFiles. Be aware that it's low-level and pretty generic: you have to learn how to use it and will need to write the code that implements the particular processing you want to apply to each file.

One last thing: in the current version of BioC (3.2), the algorithm argument is defunct. I highly recommend that you update your installation to use the current BioC. Oldest versions are not supported.

Cheers,

H.

ADD REPLYlink written 3.8 years ago by Hervé Pagès ♦♦ 14k

@ Hervé Pagès: really appreciated your kind favor so far !!

 

ADD REPLYlink modified 3.8 years ago • written 3.8 years ago by Jurat Shahidin70

Hi,

I tried your solution it works. But, when I do like this:

x <- as(import.bed(bed.1), "GRanges")
y <- as(import.bed(bed.2), "GRanges")

res <- as(findOverlaps(x,y), "list")
qh <- x[queryHits(res)]       # gave me error
sh <- y[subjectHits(res)]     # gave me error

 

instead, I tried like this: (I am still insist on row wise operation for GRanges objects)

x <- import.bed(bed.1)
y <- as(import.bed(bed.2), "GRanges")

o <- findOverlaps(query= sapply(x, function(i) { as(i, "GRanges")}), subject=y, minOverlaps=10)

res <- data.frame(query[queryHits(o)], subject[subjectHits(o)])

 

Could you evaluate the feasibility of my code? Thanks a lot

ADD REPLYlink modified 3.7 years ago • written 3.7 years ago by Jurat Shahidin70

Hi,

I was suggesting that you coerce to List, not to list, i.e. that you do res <- as(findOverlaps(x,y), "List"), not res <- as(findOverlaps(x,y), "list"). The former is more efficient. Anyway the result is not a Hits object so it doesn't make sense to call queryHits() or subjectHits() on it.

I'm not sure what your code is trying to do. If all you want to do is store the hits in a data.frame, then just call as.data.frame() on the Hits object returned by findOverlaps():

as.data.frame(findOverlaps(x, y), minOverlaps=10)

You don't need any loop for that.

H.

ADD REPLYlink modified 3.7 years ago • written 3.7 years ago by Hervé Pagès ♦♦ 14k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 16.09
Traffic: 209 users visited in the last hour