Hi all,
I ran into performance issues when using readGAlignmentPairs with a ‘which’ argument (passed via a ScanBamParam object).
I’m working with a BAM file with paired alignments that can be read with readGAlignmentPairs in ~15 minutes. The function call results in a warning message regarding 30 out-of-bound ranges detected on the mitochondrial chromosome (see below).
When using a which argument including the mitochondrial chromosome only, the function takes at least several hours (I didn’t wait for it to finish). I assume this is related to the out-of-bound ranges. Is there a way to avoid the long run-time when using the which argument? I can provide test data if required.
Many thanks
Leonard
> system.time(x <- readGAlignmentPairs(file = file_bam))
user system elapsed
762.898 83.143 849.589
Warning messages:
1: In GenomicRanges:::valid.GenomicRanges.seqinfo(x) :
GAlignments object contains 30 out-of-bound ranges located on sequence
MT. Note that only ranges located on a non-circular sequence whose
length is not NA can be considered out-of-bound (use seqlengths() and
isCircular() to get the lengths and circularity flags of the underlying
sequences).
2: In .make_GAlignmentPairs_from_GAlignments(gal, use.mcols = use.mcols) :
48218 pairs (0 proper, 48218 not proper) were dropped because the seqname
or strand of the alignments in the pair were not concordant.
Note that a GAlignmentPairs object can only hold concordant pairs at the
moment, that is, pairs where the 2 alignments are on the opposite strands
of the same chromosome.
>
> sessionInfo()
R Under development (unstable) (2014-11-04 r66932)
Platform: x86_64-unknown-linux-gnu (64-bit)
locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
[3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=en_US.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
attached base packages:
[1] stats4 parallel stats graphics grDevices utils datasets
[8] methods base
other attached packages:
[1] GenomicAlignments_1.3.27 Rsamtools_1.19.26 Biostrings_2.35.7
[4] XVector_0.7.3 SGSeq_1.1.17 GenomicRanges_1.19.35
[7] GenomeInfoDb_1.3.12 IRanges_2.1.36 S4Vectors_0.5.17
[10] BiocGenerics_0.13.4 BiocParallel_1.1.13
loaded via a namespace (and not attached):
[1] AnnotationDbi_1.29.17 base64enc_0.1-2 BatchJobs_1.5
[4] BBmisc_1.8 Biobase_2.27.1 biomaRt_2.23.5
[7] bitops_1.0-6 brew_1.0-6 checkmate_1.5.1
[10] codetools_0.2-10 DBI_0.3.1 digest_0.6.8
[13] fail_1.2 foreach_1.4.2 GenomicFeatures_1.19.17
[16] igraph_0.7.1 iterators_1.0.7 RCurl_1.95-4.5
[19] RSQLite_1.0.0 rtracklayer_1.27.7 sendmailR_1.2-1
[22] stringr_0.6.2 tools_3.2.0 XML_3.98-1.1
[25] zlibbioc_1.13.0
>
Hi Leonard,
readGAlignmentPairs()
callsreadGAlignments()
which itself callsscanBam()
and theparam
arg is passed all the way down toscanBam()
.readGAlignmentPairs()
andreadGAlignments()
don't do anything special with it. Can you reproduce this with a direct call toscanBam()
, after opening the BamFile withasMates=TRUE
? You could use something like:Then call
scanBam()
onbamfile
with and without thewhich
argument.We could do this for you if we had access to your file (which we're going to need anyway at some point I guess).
Thanks,
H.