slow granges in GAlignmentPairs
1
0
Entering edit mode
Arne Muller ▴ 20
@arne-muller-8308
Last seen 7.9 years ago
United States

Hello,

I'm trying to extract the position and fragment length of paired end reads from bam files using readGAlignmentPairs and granges. There was already a posting (GenomicAlignments - Speeding up readGAlignmentPairs()) about the performance of readGAlignmentPairs, but another issue is that the granges method for GAlignmentPairs is much slower than the one from GAlignment.

I tried to find an alternative for the GAlignmentPairs granges method and found the code below to work between 4 and 7 times faster (the performance ratio depends on the size of the bam file):

ga is a larger GAlignmentPairs object.

mygranges = function(ga) {

     l = granges(last(ga))
     f = granges(first(ga))
     pos = which(strand(f) == '+')
     neg = which(strand(f) == '-')
     gr.pos = GRanges(seqnames=seqnames(f[pos]),
     ranges=IRanges(start=start(f[pos]), end=end(l[pos])), strand='+')

     gr.neg = GRanges(seqnames=seqnames(f[neg]),
     ranges=IRanges(start=start(l[neg]), end=end(f[neg])), strand='-')
     
     i = order(c(pos, neg))
     return(c(gr.pos, gr.neg)[i])
}

> system.time(gr <- mygranges(ga[1:1000000]))
   user  system elapsed 
  2.635   0.027   2.664 
> system.time(gr.orig <- granges(ga[1:1000000]))
   user  system elapsed 
 11.593   0.117  11.721 

> table(gr.orig == gr)

   TRUE 
1000000 

<font face="monospace">For my full data set this is:</font>

> system.time(gr <- mygranges(ga))

   user  system elapsed 

112.800  57.311 174.031 

> system.time(gr.orig <- granges(ga))

    user   system  elapsed 

1033.727  263.714 1320.971 

> table(gr.orig == gr)
    TRUE 
47483073 

It seems that the results are identical between granges and "mygranges" but the latter is much faster. However, I'm not sure that "mygranges" covers all of the original functionality (also, I'm still on 3.2, so I can't tell whether this was fixed in 3.3 or GenomicAlignments 1.8). The alternative implementation is still rather slow, but I can't think of anything simple that'd speed it up dramatically (at least not without digging into GAlignmentPairs).

If you think this granges implementation is correct (and covers all cases the original granges does) it might be worth considering it for GAlignmentPairs. 

     Arne

> sessionInfo()

R version 3.2.2 Patched (2015-09-03 r69280)

Platform: x86_64-pc-linux-gnu (64-bit)

Running under: Red Hat Enterprise Linux Server release 6.7 (Santiago)



locale:

 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              

 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    

 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   

 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 

 [9] LC_ADDRESS=C               LC_TELEPHONE=C            

[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       



attached base packages:

[1] stats4    parallel  stats     graphics  grDevices utils     datasets 

[8] methods   base     



other attached packages:

 [1] GenomicAlignments_1.6.0    Rsamtools_1.22.0          

 [3] Biostrings_2.38.0          XVector_0.10.0            

 [5] SummarizedExperiment_1.0.0 Biobase_2.30.0            

 [7] GenomicRanges_1.22.0       GenomeInfoDb_1.6.0        

 [9] IRanges_2.4.0              S4Vectors_0.8.0           

[11] BiocGenerics_0.16.0        BatchJobs_1.6             

[13] BiocParallel_1.4.0         BiocInstaller_1.20.1      



loaded via a namespace (and not attached):

 [1] magrittr_1.5         zlibbioc_1.16.0      brew_1.0-6          

 [4] sendmailR_1.2-1      stringr_1.0.0        tools_3.2.2         

 [7] fail_1.2             checkmate_1.6.2      DBI_0.3.1           

[10] lambda.r_1.1.7       futile.logger_1.4.1  digest_0.6.8        

[13] bitops_1.0-6         base64enc_0.1-3      futile.options_1.0.0

[16] RSQLite_1.0.0        stringi_0.5-5        BBmisc_1.9    
GenomicAlignments • 825 views
ADD COMMENT
1
Entering edit mode
@herve-pages-1542
Last seen 19 hours ago
Seattle, WA, United States

Hi Arne,

Thanks for pointing this out and for proposing a more efficient implementation. I'll take a close look at it and make sure it doesn't introduce any regression. If everything looks OK I'll use it to improve the speed of the "granges" method for GAlignmentPairs objects.

Cheers,

H.

ADD COMMENT
0
Entering edit mode

This is implemented in GenomicAlignments 1.8.3 (release) and 1.9.4 (devel). Both versions will become available via biocLite() in the next 48 hours or so.

Thanks again for your feedback,

H.

ADD REPLY

Login before adding your answer.

Traffic: 755 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6