scanBam with ScanBamParam argument which specified ignores yieldSize. Intentionally?
2
0
Entering edit mode
Johannes Rainer ★ 2.0k
@johannes-rainer-6987
Last seen 4 weeks ago
Italy

Dear all!

For performance reasons I wanted to extract only a limited set of aligned reads from a specified genomic region and thus I set the yieldSize of the BamFile to e.g. 5000. scanBam however ignores that yieldSize completely and returns all of the reads in the genomic region specified with the argument which of ScanBamParam. Is this intended?

The example code I run:

> library(Rsamtools)
> myBf <- BamFile(bf, index=paste0(bf, ".bai"), yieldSize=5000, asMates=TRUE)
> myParam <- ScanBamParam(what=scanBamWhat(), tag="MD", flag=scanBamFlag(isUnmappedQuery=FALSE))
> Test <- scanBam(myBf, param=myParam)
> length(Test[[1]]$seq)
[1] 10000

## Running the same with which:
> myParam <- ScanBamParam(what=scanBamWhat(), tag="MD", flag=scanBamFlag(isUnmappedQuery=FALSE), which=GRanges("11", ranges=IRanges(69641087, 69654474)))
> Test <- scanBam(myBf, param=myParam)
> length(Test[[1]]$seq)
[1] 146708

 

Thanks for any input

cheers, jo

 

 

> sessionInfo()
R version 3.2.2 (2015-08-14)
Platform: x86_64-apple-darwin15.0.0/x86_64 (64-bit)
Running under: OS X 10.11.2 (El Capitan)

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] parallel  stats4    stats     graphics  grDevices utils     datasets
[8] methods   base     

other attached packages:
[1] Rsamtools_1.22.0     Biostrings_2.38.1    XVector_0.10.0      
[4] GenomicRanges_1.22.1 GenomeInfoDb_1.6.1   IRanges_2.4.1       
[7] S4Vectors_0.8.2      BiocGenerics_0.16.1

loaded via a namespace (and not attached):
[1] zlibbioc_1.16.0      futile.logger_1.4.1  lambda.r_1.1.7      
[4] futile.options_1.0.0 BiocParallel_1.4.0   bitops_1.0-6        

 

scanbam scanbamparam rsamtools • 1.7k views
ADD COMMENT
1
Entering edit mode
@martin-morgan-1513
Last seen 16 days ago
United States

From ?BamFile

     yieldSize: Number of records to yield each time the file is read
          from using 'scanBam' or, when 'length(bamWhich()) != 0', a
          threshold which yields records in complete ranges whose sum
          first exceeds 'yieldSize'.

So yes, the behavior you see is intentional.  For many ranges, it is often more efficient to iterate through the entire file and subsetByOverlaps(), e.g., using GenomicFiles::reduceByYield() and readGAlignments().

ADD COMMENT
0
Entering edit mode
@timotheeflutre-6727
Last seen 5.0 years ago
France

From what I understand, yieldSize=5000 doesn't mean "read only the first 5000 records" but "read all records by batch, with 5000 records per batch". It helps when loading all records in memory at once would be too big.

ADD COMMENT

Login before adding your answer.

Traffic: 839 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6