I stumbled over the following thread again, since I misinterpreted the which argument in scanBamParam.
I would expect from the 'which' argument only unique reads if I use multiple ranges overlapping or not. But the way it is implemented is the samtools way. For every requested range, the overlapping reads are extracted. Since in the post from 2015 the documentation was supposed to be updated to clarify this Im wondering where this is documented. I would have expected this in scanBamParam under @param which. Or am I mistaken? Just wanted to post this for future misunderstandings of other users.
Would be there are a way to add a parameter to extract unique reads only? Otherwise I would need to load all reads and then remove duplicated ones, which is not memory efficient since I could load reads which I would dump afterwards again.
This is an updated version of the example code from the blog post to illustrate the issue:
library('GenomicAlignments') ## Code already included in ?readGAlignments bamfile <- system.file("extdata", "ex1.bam", package="Rsamtools", mustWork=TRUE) which <- IRangesList(seq1=IRanges(1000, 2000), seq2=IRanges(c(100, 1000), c(1000, 2000))) param <- ScanBamParam(which=which) gal4 <- readGAlignments(bamfile, param=param) gal4 ## Note that if overlapping ranges are provided in 'which' ## reads could be selected more than once. This would artificually ## increase the coverage or affect other downstream results. ## If you 'reduce' the ranges, reads that originally overlapped ## two disjoint segments will be included. which_dups <- IRangesList(seq1=rep(IRanges(1000, 2000), 2), seq2=IRanges(c(100, 1000), c(1000, 2000))) param_dups <- ScanBamParam(which=which_dups) param_reduced <- ScanBamParam(which=reduce(which_dups)) gal4_dups <- readGAlignments(bamfile, param=param_dups) gal4_reduced <- readGAlignments(bamfile, param=param_reduced) length(gal4) ## Duplicates some reads. In this case, all the ones between ## bases 1000 and 2000 on seq1. length(gal4_dups) ## Includes some reads that mapped around base 1000 in seq2 ## that were excluded in gal4. length(gal4_reduced)
The session info:
> sessionInfo() R version 3.5.1 (2018-07-02) Platform: x86_64-pc-linux-gnu (64-bit) Running under: Scientific Linux 7.6 (Nitrogen) Matrix products: default BLAS: /data/nasif12/modules_if12/SL7/i12g/R/3.5.1-Bioc3.8/lib64/R/lib/libRblas.so LAPACK: /data/nasif12/modules_if12/SL7/i12g/R/3.5.1-Bioc3.8/lib64/R/lib/libRlapack.so locale:  LC_CTYPE=en_US.utf8 LC_NUMERIC=C  LC_TIME=en_US.utf8 LC_COLLATE=en_US.utf8  LC_MONETARY=en_US.utf8 LC_MESSAGES=en_US.utf8  LC_PAPER=en_US.utf8 LC_NAME=C  LC_ADDRESS=C LC_TELEPHONE=C  LC_MEASUREMENT=en_US.utf8 LC_IDENTIFICATION=C attached base packages:  stats4 parallel stats graphics grDevices utils datasets  methods base other attached packages:  GenomicAlignments_1.18.1 Rsamtools_1.34.1  Biostrings_2.50.2 XVector_0.22.0  SummarizedExperiment_1.12.0 DelayedArray_0.8.0  BiocParallel_1.16.6 matrixStats_0.54.0  Biobase_2.42.0 GenomicRanges_1.34.0  GenomeInfoDb_1.18.2 IRanges_2.16.0  S4Vectors_0.20.1 BiocGenerics_0.28.0 loaded via a namespace (and not attached):  lattice_0.20-38 bitops_1.0-6 grid_3.5.1  zlibbioc_1.28.0 Matrix_1.2-16 tools_3.5.1  RCurl_1.95-4.12 compiler_3.5.1 GenomeInfoDbData_1.2.0