Duplicated reads if 'which' has overlapping reads in ScanBamParam
0
0
Entering edit mode
@mertes-11992
Last seen 21 months ago
Germany

Hey all,

I stumbled over the following thread again, since I misinterpreted the which argument in scanBamParam.

https://stat.ethz.ch/pipermail/bioc-devel/2015-February/006985.html

I would expect from the 'which' argument only unique reads if I use multiple ranges overlapping or not. But the way it is implemented is the samtools way. For every requested range, the overlapping reads are extracted. Since in the post from 2015 the documentation was supposed to be updated to clarify this Im wondering where this is documented. I would have expected this in scanBamParam under @param which. Or am I mistaken? Just wanted to post this for future misunderstandings of other users.

https://www.rdocumentation.org/packages/Rsamtools/versions/1.24.0/topics/ScanBamParam

Would be there are a way to add a parameter to extract unique reads only? Otherwise I would need to load all reads and then remove duplicated ones, which is not memory efficient since I could load reads which I would dump afterwards again.

This is an updated version of the example code from the blog post to illustrate the issue:

library('GenomicAlignments')

## Code already included in ?readGAlignments
bamfile <- system.file("extdata", "ex1.bam", package="Rsamtools", mustWork=TRUE)
which <- IRangesList(seq1=IRanges(1000, 2000), seq2=IRanges(c(100, 1000), c(1000, 2000)))
param <- ScanBamParam(which=which)
gal4 <- readGAlignments(bamfile, param=param)
gal4

## Note that if overlapping ranges are provided in 'which'
## reads could be selected more than once. This would artificually
## increase the coverage or affect other downstream results.
## If you 'reduce' the ranges, reads that originally overlapped
## two disjoint segments will be included.
which_dups <- IRangesList(seq1=rep(IRanges(1000, 2000), 2), seq2=IRanges(c(100, 1000), c(1000, 2000)))
param_dups <- ScanBamParam(which=which_dups)
param_reduced <- ScanBamParam(which=reduce(which_dups))
gal4_dups <- readGAlignments(bamfile, param=param_dups)
gal4_reduced <- readGAlignments(bamfile, param=param_reduced)

length(gal4)

## Duplicates some reads. In this case, all the ones between
## bases 1000 and 2000 on seq1.
length(gal4_dups)

## Includes some reads that mapped around base 1000 in seq2
## that were excluded in gal4.
length(gal4_reduced)

The session info:

> sessionInfo()
R version 3.5.1 (2018-07-02)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Scientific Linux 7.6 (Nitrogen)

Matrix products: default
BLAS: /data/nasif12/modules_if12/SL7/i12g/R/3.5.1-Bioc3.8/lib64/R/lib/libRblas.so
LAPACK: /data/nasif12/modules_if12/SL7/i12g/R/3.5.1-Bioc3.8/lib64/R/lib/libRlapack.so

locale:
 [1] LC_CTYPE=en_US.utf8       LC_NUMERIC=C
 [3] LC_TIME=en_US.utf8        LC_COLLATE=en_US.utf8
 [5] LC_MONETARY=en_US.utf8    LC_MESSAGES=en_US.utf8
 [7] LC_PAPER=en_US.utf8       LC_NAME=C
 [9] LC_ADDRESS=C              LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.utf8 LC_IDENTIFICATION=C

attached base packages:
[1] stats4    parallel  stats     graphics  grDevices utils     datasets
[8] methods   base

other attached packages:
 [1] GenomicAlignments_1.18.1    Rsamtools_1.34.1
 [3] Biostrings_2.50.2           XVector_0.22.0
 [5] SummarizedExperiment_1.12.0 DelayedArray_0.8.0
 [7] BiocParallel_1.16.6         matrixStats_0.54.0
 [9] Biobase_2.42.0              GenomicRanges_1.34.0
[11] GenomeInfoDb_1.18.2         IRanges_2.16.0
[13] S4Vectors_0.20.1            BiocGenerics_0.28.0

loaded via a namespace (and not attached):
[1] lattice_0.20-38        bitops_1.0-6           grid_3.5.1
[4] zlibbioc_1.28.0        Matrix_1.2-16          tools_3.5.1
[7] RCurl_1.95-4.12        compiler_3.5.1         GenomeInfoDbData_1.2.0
documentation Rsamtools GenomicAlignments • 1.1k views
ADD COMMENT
0
Entering edit mode

I added a short comment to ?ScanBamParam in version 1.99.4, which will be available in 'devel' in the next 24 - 48 hours; the updated documentation will be included in the next release, anticipated at the end of April.

Note that the rdocumentation link is for Rsamtools version 1.24.0, so more than 5 releases (2.5 years) out of date.

ADD REPLY
0
Entering edit mode

Thanks Martin! I think this helps in the future to clarify what is parsed and extracted.

ADD REPLY

Login before adding your answer.

Traffic: 860 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6