Question

Performing findOverlaps when both subject and query are on the order of a million

0

Entering edit mode

mrk.carty ▴ 30

@mrkcarty-7442

Last seen 10.3 years ago

United States

Hi everyone,

I want to do a findOverlaps when both subject and query are very large. My query is a GRanges object of genomic intervals on a particular chromosome (~8 million ranges), and my subject is another GRanges object of roughly 2 million reads. I thought of dividing the queries into smaller chunks (~76K), and then performing the overlaps on these smaller chunk as multiple jobs on a Torque job scheduler. I requested 4 processor for one node, 40 gb of real and virtual memory, and 10 gb of memory for both physical memory and virtual memory for each processor.

> reads
GRangesList object of length 2:
$R1
GRanges object with 1948817 ranges and 0 metadata columns:
            seqnames               ranges strand
               <Rle>            <IRanges> <Rle>
        [1]    chr21 [34574583, 34574683]      -
        [2]    chr21 [29410526, 29410626]      +
        [3]    chr21 [16007419, 16007519]      -
        [4]    chr21 [18682677, 18682777]      -
        [5]    chr21 [16577271, 16577323]      -
        ...      ...                  ...    ...
[1948813]    chr21 [37892775, 37892875]      +
[1948814]    chr21 [44730179, 44730279]      +
[1948815]    chr21 [21558129, 21558229]      +
[1948816]    chr21 [40091321, 40091421]      +
[1948817]    chr21 [23817243, 23817339]      +

...
<1 more element>
-------
seqinfo: 1 sequence from an unspecified genome; no seqlengths

> pairs
GRangesList object of length 2:
$R1
GRanges object with 7651444 ranges and 0 metadata columns:
            seqnames               ranges strand
               <Rle>            <IRanges> <Rle>
        [1]    chr21   [9410946, 9417138]      *
        [2]    chr21   [9410946, 9417138]      *
        [3]    chr21   [9410946, 9417138]      *
        [4]    chr21   [9410946, 9417138]      *
        [5]    chr21   [9410946, 9417138]      *
        ...      ...                  ...    ...
[7651440]    chr21 [48113915, 48117945]      *
[7651441]    chr21 [48113915, 48117945]      *
[7651442]    chr21 [48113915, 48117945]      *
[7651443]    chr21 [48113915, 48117945]      *
[7651444]    chr21 [48113915, 48117945]      *

...
<1 more element>
-------
seqinfo: 1 sequence from an unspecified genome; no seqlengths

cores = 4;

This is my code:

library(GenomicRanges)
chunk <- function(x,n) split(x,sort(rank(seq_len(length(x))%%n)))

overlaps <- function(G1,G2,forward,reverse,cores){
     library(parallel)
     hitsL <- findOverlaps(G1,forward)
     hitsR <- findOverlaps(G2,reverse)
     L <- split(subjectHits(hitsL),queryHits(hitsL))
     R <- split(subjectHits(hitsR),queryHits(hitsR))
     res <- unlist(mclapply(1:length(G1),function(x) length(intersect(L[[x]],R[[x]])),mc.cores=cores))
     return (res)
}

n <- length(pairs$R1)
forward <- c(reads$R1,reads$R2)
reverse <- c(reads$R2,reads$R1)

G1 <- chunk(pairs$R1,100)
G2 <- chunk(pairs$R2,100)

counts <- NULL

time.proc <- system.time({for(i in 1:3) counts <- c(counts,overlaps(G1[[i]],G2[[i]],forward,reverse,cores)) })[3]/60

time.proc
elapsed
4.4179

It would take about 2 hrs to complete the job. How can I achieve a speedup?

> sessionInfo()
R version 3.1.1 (2014-07-10)
Platform: x86_64-unknown-linux-gnu (64-bit)

locale:
[1] C

attached base packages:
[1] stats4 parallel stats graphics grDevices utils datasets
[8] methods base

other attached packages:
[1] doMC_1.3.3           iterators_1.0.7      foreach_1.4.2
[4] GenomicRanges_1.18.1 GenomeInfoDb_1.2.0   IRanges_2.0.0
[7] S4Vectors_0.4.0      BiocGenerics_0.12.0

loaded via a namespace (and not attached):
[1] XVector_0.6.0 codetools_0.2-9 tools_3.1.1

Best,

Mark

findoverlaps • 1.7k views

ADD COMMENT • link updated 10.9 years ago by Hervé Pagès 16k • written 10.9 years ago by mrk.carty ▴ 30

score 1 · Answer 1 · 2015-03-12

Hi Mark,

findOverlaps()/countOverlaps() and family were re-implemented to use the Nested Containment List algo in BioC devel (BioC 3.1, requires R 3.2). See this announcement for the details:

https://stat.ethz.ch/pipermail/bioc-devel/2014-December/006749.html

With a query and subject on the order of a million, I would expect the new algorithm to be much faster than the old one, maybe 5x or 10x, and even more if there is a lot (e.g. 50 millions) of overlaps to return in the Hits object. Memory usage should also be reduced significantly (roughly by half).

Cheers,

H.