readGAlignmentPairs slow in problematic region
1
1
Entering edit mode
@mikelove
Last seen 9 hours ago
United States

I noticed that readGAlignmentPairs() is slow to read in some alignments for a gene when I provide the range of the gene as 'which' in the ScanBamParam. The gene is covered by some (likely) misaligned reads, which contain a gap that spans the entire range of the gene of interest. I've pre-filtered out an additional set of alignments which had quality score < 4, and tried various flags, but it's still slow to process the alignments as pairs. I have a subset BAM file (5Mb) which reproduces the problem which I will email to maintainer [at] bioc. The genome is hg19.

 

> system.time({ ga <- readGAlignments("subset.bam") })
   user  system elapsed 
  0.493   0.012   0.510 
> length(ga)
[1] 95502

> system.time({ ga <- readGAlignmentPairs("subset.bam") })
   user  system elapsed 
  1.870   0.014   1.887 
Warning message:
In .make_GAlignmentPairs_from_GAlignments(gal, strandMode = strandMode,  :
  6 pairs (0 proper, 6 not proper) were dropped because the seqname
  or strand of the alignments in the pair were not concordant.
  Note that a GAlignmentPairs object can only hold concordant pairs at the
  moment, that is, pairs where the 2 alignments are on the opposite strands
  of the same reference sequence.
Calls: system.time ... readGAlignmentPairs -> .make_GAlignmentPairs_from_GAlignments
> length(ga)
[1] 38579

> system.time({ ga <- readGAlignmentPairs("subset.bam", param=ScanBamParam(which=GRanges("chr9",IRanges(19228763,19376137)))) })
   user  system elapsed 
177.091   2.380 179.525 
Warning message:
In .make_GAlignmentPairs_from_GAlignments(gal, strandMode = strandMode,  :
  1 pairs (0 proper, 1 not proper) were dropped because the seqname
  or strand of the alignments in the pair were not concordant.
  Note that a GAlignmentPairs object can only hold concordant pairs at the
  moment, that is, pairs where the 2 alignments are on the opposite strands
  of the same reference sequence.
Calls: system.time ... readGAlignmentPairs -> .make_GAlignmentPairs_from_GAlignments

> system.time({ ga <- readGAlignmentPairs("subset.bam", param=ScanBamParam(which=GRanges("chr9",IRanges(19228763,19376137)), flag=scanBamFlag(isProperPair=TRUE, isSecondaryAlignment=FALSE))) })
   user  system elapsed 
148.353   2.260 150.659 
> length(ga)
[1] 3190

> sessionInfo()
R Under development (unstable) (2015-06-25 r68588)
Platform: x86_64-unknown-linux-gnu (64-bit)
Running under: CentOS release 6.5 (Final)

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats4    parallel  stats     graphics  grDevices datasets  utils    
[8] methods   base     

other attached packages:
 [1] GenomicAlignments_1.5.12   Rsamtools_1.21.14         
 [3] Biostrings_2.37.2          XVector_0.9.1             
 [5] SummarizedExperiment_0.3.2 Biobase_2.29.1            
 [7] GenomicRanges_1.21.17      GenomeInfoDb_1.5.9        
 [9] IRanges_2.3.17             S4Vectors_0.7.12          
[11] BiocGenerics_0.15.5        BiocInstaller_1.19.9      

loaded via a namespace (and not attached):
[1] Rcpp_0.11.6          bitops_1.0-6         futile.options_1.0.0
[4] git2r_0.10.1         zlibbioc_1.15.0      futile.logger_1.4.1 
[7] lambda.r_1.1.7       BiocParallel_1.3.47  tools_3.3.0         
> 
genomicalignments • 1.4k views
ADD COMMENT
0
Entering edit mode

Hi Michael,

Yes, making the file available somewhere or sending it to maintainer AT bioconductor.org would be helpful. Thanks!

H.

ADD REPLY
3
Entering edit mode
@valerie-obenchain-4275
Last seen 2.2 years ago
United States

Hi Mike,

Sorry this has taken awhile. Martin and I have checked in a fix to Rsamtools 1.21.16.

As you discovered, there was quite a problem finding out-of-range mates in a deep coverage file. Previously mate searching was done at the level of each qname. The new code moves the searching up a level so we can look at (and order) mate position for all records at once. We also leverage knowledge of the bin structure in the index file by grouping mate positions by distance then search specific regions of the bam file that contain the groups.

The speed-up is a suprising 95+% increase in files with deep coverage. There isn't much difference in files with 'regular' (not deep) coverage (pasillaBamSubset example below).  This change affects the performance of mate pairing in a range only and has no effect on pairing a complete file.

Let me know how it goes.

Valerie

library(GenomicAlignments)

library(pasillaBamSubset)
 

## Rsamtools  <= 1.21.15

Deep coverage, range width 147375:

> fl <- "subset.bam"

> param = ScanBamParam(which=GRanges("chr9",IRanges(19228763,19376137)))
> system.time({gal <- readGAlignmentsList(fl, param=param) })
   user  system elapsed
 55.120   0.457  55.608
> length(gal)
[1] 4055
> table(mcols(gal)$mate_status)

    mated ambiguous   unmated
     3578         0       477


Regular coverage, range width 147375:

> fl = untreated3_chr4()
> param = ScanBamParam(which = GRanges("chr4", IRanges(194, 147568)))
> system.time({gal <- readGAlignmentsList(fl, param=param) })
   user  system elapsed
  0.251   0.001   0.252 


## Rsamtools 1.21.16

Deep coverage, range of width 147375:

> fl <- "subset.bam"

> param = ScanBamParam(which=GRanges("chr9",IRanges(19228763,19376137)))

> system.time({gal <- readGAlignmentsList(fl, param=param) })
   user  system elapsed
  0.286   0.000   0.286
> length(gal)
[1] 4055
> table(mcols(gal)$mate_status)

    mated ambiguous   unmated
     3578         0       477


Regular coverage, range of width 147375:

> fl = untreated3_chr4()
> param = ScanBamParam(which = GRanges("chr4", IRanges(194, 147568)))
> system.time({gal <- readGAlignmentsList(fl, param=param) })
   user  system elapsed
  0.233   0.002   0.234 

ADD COMMENT
0
Entering edit mode

Just edited my answer to make it more clear what file was being used ... since I named them both 'fl'.

Valerie

ADD REPLY
0
Entering edit mode

thanks Valerie and Martin! I will check it out and report back. this will really help speed up some scripts I'm working on.

ADD REPLY
0
Entering edit mode

I can confirm this fixes the slow down on my end. Thanks again, this is really helpful for me.

ADD REPLY

Login before adding your answer.

Traffic: 942 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6