I noticed that readGAlignmentPairs() is slow to read in some alignments for a gene when I provide the range of the gene as 'which' in the ScanBamParam. The gene is covered by some (likely) misaligned reads, which contain a gap that spans the entire range of the gene of interest. I've pre-filtered out an additional set of alignments which had quality score < 4, and tried various flags, but it's still slow to process the alignments as pairs. I have a subset BAM file (5Mb) which reproduces the problem which I will email to maintainer [at] bioc. The genome is hg19.
> system.time({ ga <- readGAlignments("subset.bam") })
user system elapsed
0.493 0.012 0.510
> length(ga)
[1] 95502
> system.time({ ga <- readGAlignmentPairs("subset.bam") })
user system elapsed
1.870 0.014 1.887
Warning message:
In .make_GAlignmentPairs_from_GAlignments(gal, strandMode = strandMode, :
6 pairs (0 proper, 6 not proper) were dropped because the seqname
or strand of the alignments in the pair were not concordant.
Note that a GAlignmentPairs object can only hold concordant pairs at the
moment, that is, pairs where the 2 alignments are on the opposite strands
of the same reference sequence.
Calls: system.time ... readGAlignmentPairs -> .make_GAlignmentPairs_from_GAlignments
> length(ga)
[1] 38579
> system.time({ ga <- readGAlignmentPairs("subset.bam", param=ScanBamParam(which=GRanges("chr9",IRanges(19228763,19376137)))) })
user system elapsed
177.091 2.380 179.525
Warning message:
In .make_GAlignmentPairs_from_GAlignments(gal, strandMode = strandMode, :
1 pairs (0 proper, 1 not proper) were dropped because the seqname
or strand of the alignments in the pair were not concordant.
Note that a GAlignmentPairs object can only hold concordant pairs at the
moment, that is, pairs where the 2 alignments are on the opposite strands
of the same reference sequence.
Calls: system.time ... readGAlignmentPairs -> .make_GAlignmentPairs_from_GAlignments
> system.time({ ga <- readGAlignmentPairs("subset.bam", param=ScanBamParam(which=GRanges("chr9",IRanges(19228763,19376137)), flag=scanBamFlag(isProperPair=TRUE, isSecondaryAlignment=FALSE))) })
user system elapsed
148.353 2.260 150.659
> length(ga)
[1] 3190
> sessionInfo()
R Under development (unstable) (2015-06-25 r68588)
Platform: x86_64-unknown-linux-gnu (64-bit)
Running under: CentOS release 6.5 (Final)
locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
[3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=en_US.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
attached base packages:
[1] stats4 parallel stats graphics grDevices datasets utils
[8] methods base
other attached packages:
[1] GenomicAlignments_1.5.12 Rsamtools_1.21.14
[3] Biostrings_2.37.2 XVector_0.9.1
[5] SummarizedExperiment_0.3.2 Biobase_2.29.1
[7] GenomicRanges_1.21.17 GenomeInfoDb_1.5.9
[9] IRanges_2.3.17 S4Vectors_0.7.12
[11] BiocGenerics_0.15.5 BiocInstaller_1.19.9
loaded via a namespace (and not attached):
[1] Rcpp_0.11.6 bitops_1.0-6 futile.options_1.0.0
[4] git2r_0.10.1 zlibbioc_1.15.0 futile.logger_1.4.1
[7] lambda.r_1.1.7 BiocParallel_1.3.47 tools_3.3.0
>

Hi Michael,
Yes, making the file available somewhere or sending it to maintainer AT bioconductor.org would be helpful. Thanks!
H.