Question: Performing findOverlaps when both subject and query are on the order of a million
0
gravatar for mrk.carty
4.8 years ago by
mrk.carty30
United States
mrk.carty30 wrote:

Hi everyone,

I want to do a findOverlaps when both subject and query are very large. My query is a GRanges object of genomic intervals on a particular chromosome (~8 million ranges), and my subject is another GRanges object of roughly 2 million reads. I thought of dividing the queries into smaller chunks (~76K), and then performing the overlaps on these smaller chunk as multiple jobs on a Torque job scheduler. I requested 4 processor for one node, 40 gb of real and virtual memory, and 10 gb of memory for both physical memory and virtual memory for each processor.

> reads
GRangesList object of length 2:
$R1
GRanges object with 1948817 ranges and 0 metadata columns:
            seqnames               ranges strand
               <Rle>            <IRanges>  <Rle>
        [1]    chr21 [34574583, 34574683]      -
        [2]    chr21 [29410526, 29410626]      +
        [3]    chr21 [16007419, 16007519]      -
        [4]    chr21 [18682677, 18682777]      -
        [5]    chr21 [16577271, 16577323]      -
        ...      ...                  ...    ...
  [1948813]    chr21 [37892775, 37892875]      +
  [1948814]    chr21 [44730179, 44730279]      +
  [1948815]    chr21 [21558129, 21558229]      +
  [1948816]    chr21 [40091321, 40091421]      +
  [1948817]    chr21 [23817243, 23817339]      +

...
<1 more element>
-------
seqinfo: 1 sequence from an unspecified genome; no seqlengths

> pairs   
GRangesList object of length 2:
$R1
GRanges object with 7651444 ranges and 0 metadata columns:
            seqnames               ranges strand
               <Rle>            <IRanges>  <Rle>
        [1]    chr21   [9410946, 9417138]      *
        [2]    chr21   [9410946, 9417138]      *
        [3]    chr21   [9410946, 9417138]      *
        [4]    chr21   [9410946, 9417138]      *
        [5]    chr21   [9410946, 9417138]      *
        ...      ...                  ...    ...
  [7651440]    chr21 [48113915, 48117945]      *
  [7651441]    chr21 [48113915, 48117945]      *
  [7651442]    chr21 [48113915, 48117945]      *
  [7651443]    chr21 [48113915, 48117945]      *
  [7651444]    chr21 [48113915, 48117945]      *

...
<1 more element>
-------
seqinfo: 1 sequence from an unspecified genome; no seqlengths

cores = 4;

This is my code:

 library(GenomicRanges)
  chunk <- function(x,n) split(x,sort(rank(seq_len(length(x))%%n)))

  overlaps <- function(G1,G2,forward,reverse,cores){
     library(parallel)
     hitsL <- findOverlaps(G1,forward)
     hitsR <- findOverlaps(G2,reverse)
     L <- split(subjectHits(hitsL),queryHits(hitsL))
     R <- split(subjectHits(hitsR),queryHits(hitsR))
     res <- unlist(mclapply(1:length(G1),function(x) length(intersect(L[[x]],R[[x]])),mc.cores=cores))
     return (res)
  }

  n <- length(pairs$R1)
  forward <- c(reads$R1,reads$R2)
  reverse <- c(reads$R2,reads$R1)
 
  G1 <- chunk(pairs$R1,100)
  G2 <- chunk(pairs$R2,100)

  counts <- NULL

  time.proc <- system.time({for(i in 1:3) counts <- c(counts,overlaps(G1[[i]],G2[[i]],forward,reverse,cores)) })[3]/60

 time.proc
elapsed
 4.4179

 It would take about 2 hrs to complete the job. How can I achieve a speedup?

> sessionInfo()
R version 3.1.1 (2014-07-10)
Platform: x86_64-unknown-linux-gnu (64-bit)

locale:
[1] C

attached base packages:
[1] stats4    parallel  stats     graphics  grDevices utils     datasets
[8] methods   base     

other attached packages:
[1] doMC_1.3.3           iterators_1.0.7      foreach_1.4.2       
[4] GenomicRanges_1.18.1 GenomeInfoDb_1.2.0   IRanges_2.0.0       
[7] S4Vectors_0.4.0      BiocGenerics_0.12.0

loaded via a namespace (and not attached):
[1] XVector_0.6.0   codetools_0.2-9 tools_3.1.1 

Best,

Mark

 

 

 

 

findoverlaps • 653 views
ADD COMMENTlink modified 4.8 years ago by Hervé Pagès ♦♦ 14k • written 4.8 years ago by mrk.carty30
Answer: Performing findOverlaps when both subject and query are on the order of a millio
1
gravatar for Hervé Pagès
4.8 years ago by
Hervé Pagès ♦♦ 14k
United States
Hervé Pagès ♦♦ 14k wrote:

Hi Mark,

findOverlaps()/countOverlaps() and family were re-implemented to use the Nested Containment List algo in BioC devel (BioC 3.1, requires R 3.2).  See this announcement for the details:

https://stat.ethz.ch/pipermail/bioc-devel/2014-December/006749.html

With a query and subject on the order of a million, I would expect the new algorithm to be much faster than the old one, maybe 5x or 10x, and even more if there is a lot (e.g. 50 millions) of overlaps to return in the Hits object. Memory usage should also be reduced significantly (roughly by half).

Cheers,

H.

ADD COMMENTlink written 4.8 years ago by Hervé Pagès ♦♦ 14k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 16.09
Traffic: 150 users visited in the last hour