Question

scanBam with ScanBamParam argument which specified ignores yieldSize. Intentionally?

0

Entering edit mode

Johannes Rainer ★ 2.1k

@johannes-rainer-6987

Last seen 5 months ago

Italy

Dear all!

For performance reasons I wanted to extract only a limited set of aligned reads from a specified genomic region and thus I set the yieldSize of the BamFile to e.g. 5000. scanBam however ignores that yieldSize completely and returns all of the reads in the genomic region specified with the argument which of ScanBamParam. Is this intended?

The example code I run:

> library(Rsamtools)
> myBf <- BamFile(bf, index=paste0(bf, ".bai"), yieldSize=5000, asMates=TRUE)
> myParam <- ScanBamParam(what=scanBamWhat(), tag="MD", flag=scanBamFlag(isUnmappedQuery=FALSE))
> Test <- scanBam(myBf, param=myParam)
> length(Test[[1]]$seq)
[1] 10000

## Running the same with which:
> myParam <- ScanBamParam(what=scanBamWhat(), tag="MD", flag=scanBamFlag(isUnmappedQuery=FALSE), which=GRanges("11", ranges=IRanges(69641087, 69654474)))
> Test <- scanBam(myBf, param=myParam)
> length(Test[[1]]$seq)
[1] 146708

Thanks for any input

cheers, jo

> sessionInfo()
R version 3.2.2 (2015-08-14)
Platform: x86_64-apple-darwin15.0.0/x86_64 (64-bit)
Running under: OS X 10.11.2 (El Capitan)

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] parallel  stats4    stats     graphics  grDevices utils     datasets
[8] methods   base     

other attached packages:
[1] Rsamtools_1.22.0     Biostrings_2.38.1    XVector_0.10.0      
[4] GenomicRanges_1.22.1 GenomeInfoDb_1.6.1   IRanges_2.4.1       
[7] S4Vectors_0.8.2      BiocGenerics_0.16.1

loaded via a namespace (and not attached):
[1] zlibbioc_1.16.0      futile.logger_1.4.1  lambda.r_1.1.7      
[4] futile.options_1.0.0 BiocParallel_1.4.0   bitops_1.0-6

scanbam scanbamparam rsamtools • 2.0k views

ADD COMMENT • link updated 9.4 years ago by Martin Morgan 25k • written 9.4 years ago by Johannes Rainer ★ 2.1k

0

Entering edit mode

TimothéeFlutre ▴ 80

@timotheeflutre-6727

Last seen 5.9 years ago

France

From what I understand, yieldSize=5000 doesn't mean "read only the first 5000 records" but "read all records by batch, with 5000 records per batch". It helps when loading all records in memory at once would be too big.

ADD COMMENT • link 9.4 years ago TimothéeFlutre ▴ 80

score 1 · Accepted Answer · 2015-11-18

From ?BamFile

     yieldSize: Number of records to yield each time the file is read
          from using 'scanBam' or, when 'length(bamWhich()) != 0', a
          threshold which yields records in complete ranges whose sum
          first exceeds 'yieldSize'.

So yes, the behavior you see is intentional. For many ranges, it is often more efficient to iterate through the entire file and subsetByOverlaps(), e.g., using GenomicFiles::reduceByYield() and readGAlignments().