Question: Duplicated reads if 'which' has overlapping reads in ScanBamParam
0
gravatar for mertes
8 months ago by
mertes0
mertes0 wrote:

Hey all,

I stumbled over the following thread again, since I misinterpreted the which argument in scanBamParam.

https://stat.ethz.ch/pipermail/bioc-devel/2015-February/006985.html

I would expect from the 'which' argument only unique reads if I use multiple ranges overlapping or not. But the way it is implemented is the samtools way. For every requested range, the overlapping reads are extracted. Since in the post from 2015 the documentation was supposed to be updated to clarify this Im wondering where this is documented. I would have expected this in scanBamParam under @param which. Or am I mistaken? Just wanted to post this for future misunderstandings of other users.

https://www.rdocumentation.org/packages/Rsamtools/versions/1.24.0/topics/ScanBamParam

Would be there are a way to add a parameter to extract unique reads only? Otherwise I would need to load all reads and then remove duplicated ones, which is not memory efficient since I could load reads which I would dump afterwards again.

This is an updated version of the example code from the blog post to illustrate the issue:

library('GenomicAlignments')

## Code already included in ?readGAlignments
bamfile <- system.file("extdata", "ex1.bam", package="Rsamtools", mustWork=TRUE)
which <- IRangesList(seq1=IRanges(1000, 2000), seq2=IRanges(c(100, 1000), c(1000, 2000)))
param <- ScanBamParam(which=which)
gal4 <- readGAlignments(bamfile, param=param)
gal4

## Note that if overlapping ranges are provided in 'which'
## reads could be selected more than once. This would artificually
## increase the coverage or affect other downstream results.
## If you 'reduce' the ranges, reads that originally overlapped
## two disjoint segments will be included.
which_dups <- IRangesList(seq1=rep(IRanges(1000, 2000), 2), seq2=IRanges(c(100, 1000), c(1000, 2000)))
param_dups <- ScanBamParam(which=which_dups)
param_reduced <- ScanBamParam(which=reduce(which_dups))
gal4_dups <- readGAlignments(bamfile, param=param_dups)
gal4_reduced <- readGAlignments(bamfile, param=param_reduced)

length(gal4)

## Duplicates some reads. In this case, all the ones between
## bases 1000 and 2000 on seq1.
length(gal4_dups)

## Includes some reads that mapped around base 1000 in seq2
## that were excluded in gal4.
length(gal4_reduced)

The session info:

> sessionInfo()
R version 3.5.1 (2018-07-02)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Scientific Linux 7.6 (Nitrogen)

Matrix products: default
BLAS: /data/nasif12/modules_if12/SL7/i12g/R/3.5.1-Bioc3.8/lib64/R/lib/libRblas.so
LAPACK: /data/nasif12/modules_if12/SL7/i12g/R/3.5.1-Bioc3.8/lib64/R/lib/libRlapack.so

locale:
 [1] LC_CTYPE=en_US.utf8       LC_NUMERIC=C
 [3] LC_TIME=en_US.utf8        LC_COLLATE=en_US.utf8
 [5] LC_MONETARY=en_US.utf8    LC_MESSAGES=en_US.utf8
 [7] LC_PAPER=en_US.utf8       LC_NAME=C
 [9] LC_ADDRESS=C              LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.utf8 LC_IDENTIFICATION=C

attached base packages:
[1] stats4    parallel  stats     graphics  grDevices utils     datasets
[8] methods   base

other attached packages:
 [1] GenomicAlignments_1.18.1    Rsamtools_1.34.1
 [3] Biostrings_2.50.2           XVector_0.22.0
 [5] SummarizedExperiment_1.12.0 DelayedArray_0.8.0
 [7] BiocParallel_1.16.6         matrixStats_0.54.0
 [9] Biobase_2.42.0              GenomicRanges_1.34.0
[11] GenomeInfoDb_1.18.2         IRanges_2.16.0
[13] S4Vectors_0.20.1            BiocGenerics_0.28.0

loaded via a namespace (and not attached):
[1] lattice_0.20-38        bitops_1.0-6           grid_3.5.1
[4] zlibbioc_1.28.0        Matrix_1.2-16          tools_3.5.1
[7] RCurl_1.95-4.12        compiler_3.5.1         GenomeInfoDbData_1.2.0
ADD COMMENTlink written 8 months ago by mertes0

I added a short comment to ?ScanBamParam in version 1.99.4, which will be available in 'devel' in the next 24 - 48 hours; the updated documentation will be included in the next release, anticipated at the end of April.

Note that the rdocumentation link is for Rsamtools version 1.24.0, so more than 5 releases (2.5 years) out of date.

ADD REPLYlink written 8 months ago by Martin Morgan ♦♦ 24k

Thanks Martin! I think this helps in the future to clarify what is parsed and extracted.

ADD REPLYlink written 8 months ago by mertes0
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 16.09
Traffic: 167 users visited in the last hour