Hi:
I want to implement element-wise-overlaps with iteration for two different bed files. Where I read two bed files as set of GRanges objects such as:
library(rtracklayer)
bed1 <- import("foo.bed")
bed2 <- import("bar.bed")
then callling findOverlaps():
overlap[1] <- findOverlaps(bed1[1], bed.2)
I have tried for loop for iterating each row of bed.1 versus bed.2 and discovering overlapped interval. But, code was not efficient. Anyone can provide simple R script for iterating each peak interval from bed.1 versus bed.2 to detect overlapped regions?? Thanks a lot

@ Hervé Pagès : Thanks a lot for your answer.
@Hervé Pagès: sorry for this repeated question. I am trying to evaluate every genomic interval from bed.1 versus all element in bed.2 and I am gonna evaluate combined p-value of overlapped genomic regions by Fisher' exact test. based on your answer, I have coded like this:
f1<- function(x, y){ x <- GRanges(x) y <- GRanges(y) for(i in 1:length(x)){ queryBed <- x[i] overlap <- findOverlaps(queryBed, y, minoverlap = 10, algorithm = "intervaltree") } return(overlap) }but, certainly it is not good idea, R-studio still executing the code and haven't returned overlapped interval. Is there any alternative solution for that ? Do you know any package that processing multiple bed files in parallel ? many thanks
First of all the
f1function doesn't make much sense: the value of overlap will be overwritten at each iteration so the function will only return the result of the last call tofindOverlaps().Other than that: yes we both know that
for(i in 1:length(x)) findOverlaps(x[i], y)is very inefficient and I gave you a replacement for it. Did you try it? Did you run into some problems while trying it?About your last question: "Do you know any package that processing multiple bed files in parallel?"
Just to clarify: the inefficiency you're seeing only involves 2 bed files and is easy to work around. I'm not sure parallelizing would help much here but you could try (with either parallel::mclapply() or BiocParallel::bplapply()). I'm pretty confident it won't be as fast as
as(findOverlaps(x, y), "List")though. In any case, I wouldn't call this kind of parallelization "processing multiple bed files in parallel", that's misleading. The files are not processed in parallel. Anyway and FWIW the closest thing to a "package for processing multiple bed files in parallel" is GenomicFiles. Be aware that it's low-level and pretty generic: you have to learn how to use it and will need to write the code that implements the particular processing you want to apply to each file.One last thing: in the current version of BioC (3.2), the
algorithmargument is defunct. I highly recommend that you update your installation to use the current BioC. Oldest versions are not supported.Cheers,
H.
@ Hervé Pagès: really appreciated your kind favor so far !!
Hi,
I tried your solution it works. But, when I do like this:
instead, I tried like this: (I am still insist on row wise operation for GRanges objects)
x <- import.bed(bed.1) y <- as(import.bed(bed.2), "GRanges") o <- findOverlaps(query= sapply(x, function(i) { as(i, "GRanges")}), subject=y, minOverlaps=10) res <- data.frame(query[queryHits(o)], subject[subjectHits(o)])Could you evaluate the feasibility of my code? Thanks a lot
Hi,
I was suggesting that you coerce to List, not to list, i.e. that you do
res <- as(findOverlaps(x,y), "List"), notres <- as(findOverlaps(x,y), "list"). The former is more efficient. Anyway the result is not a Hits object so it doesn't make sense to callqueryHits()orsubjectHits()on it.I'm not sure what your code is trying to do. If all you want to do is store the hits in a data.frame, then just call
as.data.frame()on the Hits object returned byfindOverlaps():You don't need any loop for that.
H.