Hey all,
I stumbled over the following thread again, since I misinterpreted the which argument in scanBamParam.
https://stat.ethz.ch/pipermail/bioc-devel/2015-February/006985.html
I would expect from the 'which' argument only unique reads if I use multiple ranges overlapping or not. But the way it is implemented is the samtools way. For every requested range, the overlapping reads are extracted. Since in the post from 2015 the documentation was supposed to be updated to clarify this Im wondering where this is documented. I would have expected this in scanBamParam under @param which. Or am I mistaken? Just wanted to post this for future misunderstandings of other users.
https://www.rdocumentation.org/packages/Rsamtools/versions/1.24.0/topics/ScanBamParam
Would be there are a way to add a parameter to extract unique reads only? Otherwise I would need to load all reads and then remove duplicated ones, which is not memory efficient since I could load reads which I would dump afterwards again.
This is an updated version of the example code from the blog post to illustrate the issue:
library('GenomicAlignments')
## Code already included in ?readGAlignments
bamfile <- system.file("extdata", "ex1.bam", package="Rsamtools", mustWork=TRUE)
which <- IRangesList(seq1=IRanges(1000, 2000), seq2=IRanges(c(100, 1000), c(1000, 2000)))
param <- ScanBamParam(which=which)
gal4 <- readGAlignments(bamfile, param=param)
gal4
## Note that if overlapping ranges are provided in 'which'
## reads could be selected more than once. This would artificually
## increase the coverage or affect other downstream results.
## If you 'reduce' the ranges, reads that originally overlapped
## two disjoint segments will be included.
which_dups <- IRangesList(seq1=rep(IRanges(1000, 2000), 2), seq2=IRanges(c(100, 1000), c(1000, 2000)))
param_dups <- ScanBamParam(which=which_dups)
param_reduced <- ScanBamParam(which=reduce(which_dups))
gal4_dups <- readGAlignments(bamfile, param=param_dups)
gal4_reduced <- readGAlignments(bamfile, param=param_reduced)
length(gal4)
## Duplicates some reads. In this case, all the ones between
## bases 1000 and 2000 on seq1.
length(gal4_dups)
## Includes some reads that mapped around base 1000 in seq2
## that were excluded in gal4.
length(gal4_reduced)
The session info:
> sessionInfo()
R version 3.5.1 (2018-07-02)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Scientific Linux 7.6 (Nitrogen)
Matrix products: default
BLAS: /data/nasif12/modules_if12/SL7/i12g/R/3.5.1-Bioc3.8/lib64/R/lib/libRblas.so
LAPACK: /data/nasif12/modules_if12/SL7/i12g/R/3.5.1-Bioc3.8/lib64/R/lib/libRlapack.so
locale:
[1] LC_CTYPE=en_US.utf8 LC_NUMERIC=C
[3] LC_TIME=en_US.utf8 LC_COLLATE=en_US.utf8
[5] LC_MONETARY=en_US.utf8 LC_MESSAGES=en_US.utf8
[7] LC_PAPER=en_US.utf8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.utf8 LC_IDENTIFICATION=C
attached base packages:
[1] stats4 parallel stats graphics grDevices utils datasets
[8] methods base
other attached packages:
[1] GenomicAlignments_1.18.1 Rsamtools_1.34.1
[3] Biostrings_2.50.2 XVector_0.22.0
[5] SummarizedExperiment_1.12.0 DelayedArray_0.8.0
[7] BiocParallel_1.16.6 matrixStats_0.54.0
[9] Biobase_2.42.0 GenomicRanges_1.34.0
[11] GenomeInfoDb_1.18.2 IRanges_2.16.0
[13] S4Vectors_0.20.1 BiocGenerics_0.28.0
loaded via a namespace (and not attached):
[1] lattice_0.20-38 bitops_1.0-6 grid_3.5.1
[4] zlibbioc_1.28.0 Matrix_1.2-16 tools_3.5.1
[7] RCurl_1.95-4.12 compiler_3.5.1 GenomeInfoDbData_1.2.0
I added a short comment to
?ScanBamParam
in version 1.99.4, which will be available in 'devel' in the next 24 - 48 hours; the updated documentation will be included in the next release, anticipated at the end of April.Note that the rdocumentation link is for Rsamtools version 1.24.0, so more than 5 releases (2.5 years) out of date.
Thanks Martin! I think this helps in the future to clarify what is parsed and extracted.