Performing findOverlaps when both subject and query are on the order of a million
1
0
Entering edit mode
mrk.carty ▴ 30
@mrkcarty-7442
Last seen 9.2 years ago
United States

Hi everyone,

I want to do a findOverlaps when both subject and query are very large. My query is a GRanges object of genomic intervals on a particular chromosome (~8 million ranges), and my subject is another GRanges object of roughly 2 million reads. I thought of dividing the queries into smaller chunks (~76K), and then performing the overlaps on these smaller chunk as multiple jobs on a Torque job scheduler. I requested 4 processor for one node, 40 gb of real and virtual memory, and 10 gb of memory for both physical memory and virtual memory for each processor.

> reads
GRangesList object of length 2:
$R1
GRanges object with 1948817 ranges and 0 metadata columns:
            seqnames               ranges strand
               <Rle>            <IRanges>  <Rle>
        [1]    chr21 [34574583, 34574683]      -
        [2]    chr21 [29410526, 29410626]      +
        [3]    chr21 [16007419, 16007519]      -
        [4]    chr21 [18682677, 18682777]      -
        [5]    chr21 [16577271, 16577323]      -
        ...      ...                  ...    ...
  [1948813]    chr21 [37892775, 37892875]      +
  [1948814]    chr21 [44730179, 44730279]      +
  [1948815]    chr21 [21558129, 21558229]      +
  [1948816]    chr21 [40091321, 40091421]      +
  [1948817]    chr21 [23817243, 23817339]      +

...
<1 more element>
-------
seqinfo: 1 sequence from an unspecified genome; no seqlengths

> pairs   
GRangesList object of length 2:
$R1
GRanges object with 7651444 ranges and 0 metadata columns:
            seqnames               ranges strand
               <Rle>            <IRanges>  <Rle>
        [1]    chr21   [9410946, 9417138]      *
        [2]    chr21   [9410946, 9417138]      *
        [3]    chr21   [9410946, 9417138]      *
        [4]    chr21   [9410946, 9417138]      *
        [5]    chr21   [9410946, 9417138]      *
        ...      ...                  ...    ...
  [7651440]    chr21 [48113915, 48117945]      *
  [7651441]    chr21 [48113915, 48117945]      *
  [7651442]    chr21 [48113915, 48117945]      *
  [7651443]    chr21 [48113915, 48117945]      *
  [7651444]    chr21 [48113915, 48117945]      *

...
<1 more element>
-------
seqinfo: 1 sequence from an unspecified genome; no seqlengths

cores = 4;

This is my code:

 library(GenomicRanges)
  chunk <- function(x,n) split(x,sort(rank(seq_len(length(x))%%n)))

  overlaps <- function(G1,G2,forward,reverse,cores){
     library(parallel)
     hitsL <- findOverlaps(G1,forward)
     hitsR <- findOverlaps(G2,reverse)
     L <- split(subjectHits(hitsL),queryHits(hitsL))
     R <- split(subjectHits(hitsR),queryHits(hitsR))
     res <- unlist(mclapply(1:length(G1),function(x) length(intersect(L[[x]],R[[x]])),mc.cores=cores))
     return (res)
  }

  n <- length(pairs$R1)
  forward <- c(reads$R1,reads$R2)
  reverse <- c(reads$R2,reads$R1)
 
  G1 <- chunk(pairs$R1,100)
  G2 <- chunk(pairs$R2,100)

  counts <- NULL

  time.proc <- system.time({for(i in 1:3) counts <- c(counts,overlaps(G1[[i]],G2[[i]],forward,reverse,cores)) })[3]/60

 time.proc
elapsed
 4.4179

 It would take about 2 hrs to complete the job. How can I achieve a speedup?

> sessionInfo()
R version 3.1.1 (2014-07-10)
Platform: x86_64-unknown-linux-gnu (64-bit)

locale:
[1] C

attached base packages:
[1] stats4    parallel  stats     graphics  grDevices utils     datasets
[8] methods   base     

other attached packages:
[1] doMC_1.3.3           iterators_1.0.7      foreach_1.4.2       
[4] GenomicRanges_1.18.1 GenomeInfoDb_1.2.0   IRanges_2.0.0       
[7] S4Vectors_0.4.0      BiocGenerics_0.12.0

loaded via a namespace (and not attached):
[1] XVector_0.6.0   codetools_0.2-9 tools_3.1.1 

Best,

Mark

 

 

 

 

findoverlaps • 1.5k views
ADD COMMENT
1
Entering edit mode
@herve-pages-1542
Last seen 10 hours ago
Seattle, WA, United States

Hi Mark,

findOverlaps()/countOverlaps() and family were re-implemented to use the Nested Containment List algo in BioC devel (BioC 3.1, requires R 3.2).  See this announcement for the details:

https://stat.ethz.ch/pipermail/bioc-devel/2014-December/006749.html

With a query and subject on the order of a million, I would expect the new algorithm to be much faster than the old one, maybe 5x or 10x, and even more if there is a lot (e.g. 50 millions) of overlaps to return in the Hits object. Memory usage should also be reduced significantly (roughly by half).

Cheers,

H.

ADD COMMENT

Login before adding your answer.

Traffic: 817 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6