Search
Question: readGAlignmentPairs slow in problematic region
1
gravatar for Michael Love
2.3 years ago by
Michael Love14k
United States
Michael Love14k wrote:

I noticed that readGAlignmentPairs() is slow to read in some alignments for a gene when I provide the range of the gene as 'which' in the ScanBamParam. The gene is covered by some (likely) misaligned reads, which contain a gap that spans the entire range of the gene of interest. I've pre-filtered out an additional set of alignments which had quality score < 4, and tried various flags, but it's still slow to process the alignments as pairs. I have a subset BAM file (5Mb) which reproduces the problem which I will email to maintainer [at] bioc. The genome is hg19.

 

> system.time({ ga <- readGAlignments("subset.bam") })
   user  system elapsed 
  0.493   0.012   0.510 
> length(ga)
[1] 95502

> system.time({ ga <- readGAlignmentPairs("subset.bam") })
   user  system elapsed 
  1.870   0.014   1.887 
Warning message:
In .make_GAlignmentPairs_from_GAlignments(gal, strandMode = strandMode,  :
  6 pairs (0 proper, 6 not proper) were dropped because the seqname
  or strand of the alignments in the pair were not concordant.
  Note that a GAlignmentPairs object can only hold concordant pairs at the
  moment, that is, pairs where the 2 alignments are on the opposite strands
  of the same reference sequence.
Calls: system.time ... readGAlignmentPairs -> .make_GAlignmentPairs_from_GAlignments
> length(ga)
[1] 38579

> system.time({ ga <- readGAlignmentPairs("subset.bam", param=ScanBamParam(which=GRanges("chr9",IRanges(19228763,19376137)))) })
   user  system elapsed 
177.091   2.380 179.525 
Warning message:
In .make_GAlignmentPairs_from_GAlignments(gal, strandMode = strandMode,  :
  1 pairs (0 proper, 1 not proper) were dropped because the seqname
  or strand of the alignments in the pair were not concordant.
  Note that a GAlignmentPairs object can only hold concordant pairs at the
  moment, that is, pairs where the 2 alignments are on the opposite strands
  of the same reference sequence.
Calls: system.time ... readGAlignmentPairs -> .make_GAlignmentPairs_from_GAlignments

> system.time({ ga <- readGAlignmentPairs("subset.bam", param=ScanBamParam(which=GRanges("chr9",IRanges(19228763,19376137)), flag=scanBamFlag(isProperPair=TRUE, isSecondaryAlignment=FALSE))) })
   user  system elapsed 
148.353   2.260 150.659 
> length(ga)
[1] 3190

> sessionInfo()
R Under development (unstable) (2015-06-25 r68588)
Platform: x86_64-unknown-linux-gnu (64-bit)
Running under: CentOS release 6.5 (Final)

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats4    parallel  stats     graphics  grDevices datasets  utils    
[8] methods   base     

other attached packages:
 [1] GenomicAlignments_1.5.12   Rsamtools_1.21.14         
 [3] Biostrings_2.37.2          XVector_0.9.1             
 [5] SummarizedExperiment_0.3.2 Biobase_2.29.1            
 [7] GenomicRanges_1.21.17      GenomeInfoDb_1.5.9        
 [9] IRanges_2.3.17             S4Vectors_0.7.12          
[11] BiocGenerics_0.15.5        BiocInstaller_1.19.9      

loaded via a namespace (and not attached):
[1] Rcpp_0.11.6          bitops_1.0-6         futile.options_1.0.0
[4] git2r_0.10.1         zlibbioc_1.15.0      futile.logger_1.4.1 
[7] lambda.r_1.1.7       BiocParallel_1.3.47  tools_3.3.0         
> 
ADD COMMENTlink modified 2.2 years ago by Valerie Obenchain ♦♦ 6.4k • written 2.3 years ago by Michael Love14k

Hi Michael,

Yes, making the file available somewhere or sending it to maintainer AT bioconductor.org would be helpful. Thanks!

H.

ADD REPLYlink written 2.3 years ago by Hervé Pagès ♦♦ 13k
3
gravatar for Valerie Obenchain
2.2 years ago by
Valerie Obenchain ♦♦ 6.4k
United States
Valerie Obenchain ♦♦ 6.4k wrote:

Hi Mike,

Sorry this has taken awhile. Martin and I have checked in a fix to Rsamtools 1.21.16.

As you discovered, there was quite a problem finding out-of-range mates in a deep coverage file. Previously mate searching was done at the level of each qname. The new code moves the searching up a level so we can look at (and order) mate position for all records at once. We also leverage knowledge of the bin structure in the index file by grouping mate positions by distance then search specific regions of the bam file that contain the groups.

The speed-up is a suprising 95+% increase in files with deep coverage. There isn't much difference in files with 'regular' (not deep) coverage (pasillaBamSubset example below).  This change affects the performance of mate pairing in a range only and has no effect on pairing a complete file.

Let me know how it goes.

Valerie

library(GenomicAlignments)

library(pasillaBamSubset)
 

## Rsamtools  <= 1.21.15

Deep coverage, range width 147375:

> fl <- "subset.bam"

> param = ScanBamParam(which=GRanges("chr9",IRanges(19228763,19376137)))
> system.time({gal <- readGAlignmentsList(fl, param=param) })
   user  system elapsed
 55.120   0.457  55.608
> length(gal)
[1] 4055
> table(mcols(gal)$mate_status)

    mated ambiguous   unmated
     3578         0       477


Regular coverage, range width 147375:

> fl = untreated3_chr4()
> param = ScanBamParam(which = GRanges("chr4", IRanges(194, 147568)))
> system.time({gal <- readGAlignmentsList(fl, param=param) })
   user  system elapsed
  0.251   0.001   0.252 


## Rsamtools 1.21.16

Deep coverage, range of width 147375:

> fl <- "subset.bam"

> param = ScanBamParam(which=GRanges("chr9",IRanges(19228763,19376137)))

> system.time({gal <- readGAlignmentsList(fl, param=param) })
   user  system elapsed
  0.286   0.000   0.286
> length(gal)
[1] 4055
> table(mcols(gal)$mate_status)

    mated ambiguous   unmated
     3578         0       477


Regular coverage, range of width 147375:

> fl = untreated3_chr4()
> param = ScanBamParam(which = GRanges("chr4", IRanges(194, 147568)))
> system.time({gal <- readGAlignmentsList(fl, param=param) })
   user  system elapsed
  0.233   0.002   0.234 

ADD COMMENTlink modified 2.2 years ago • written 2.2 years ago by Valerie Obenchain ♦♦ 6.4k

Just edited my answer to make it more clear what file was being used ... since I named them both 'fl'.

Valerie

ADD REPLYlink written 2.2 years ago by Valerie Obenchain ♦♦ 6.4k

thanks Valerie and Martin! I will check it out and report back. this will really help speed up some scripts I'm working on.

ADD REPLYlink written 2.2 years ago by Michael Love14k

I can confirm this fixes the slow down on my end. Thanks again, this is really helpful for me.

ADD REPLYlink written 2.2 years ago by Michael Love14k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.2.0
Traffic: 132 users visited in the last hour