large memory footprint with summarizeOverlaps method for BamViews
0
0
Entering edit mode
alex.gos90 ▴ 10
@alexgos90-13597
Last seen 12 weeks ago
Germany

Hello,

I would like to point out, that when I use a BamViews object that was defined with a specific bamRanges with the summarizeOverlaps method, the whole Bam file is loaded into Memory, if I do not explicitly provide the param argument.

Here is an example

library(GenomicAlignments)
tiny_bam <- system.file("extdata", "ex1.bam", package="Rsamtools", mustWork=TRUE)
fl <- c(tiny_bam,tiny_bam)
rngs <- GRanges(c("seq1", "seq2"), IRanges(1, c(15, 15)))
samp <- DataFrame(info=c("ex1","ex2"), row.names=c("ex1","ex2"))

# define the BamViews for multiple files using Rsamtools
view <- BamViews(bamPaths = fl, bamSamples=samp, bamRanges=rngs)


So these function calls will have different memory footprints because in one case we are loading the whole BAM file,

se <- summarizeOverlaps(view, mode=Union, ignore.strand=TRUE)


while in the other we only load the reads that are in the given ranges.

se <- summarizeOverlaps(view,
mode=Union,
ignore.strand=TRUE,
param=ScanBamParam(which = rngs))


I saw in the source code of the readGAlignments method for BamViews (https://github.com/Bioconductor/GenomicAlignments/blob/master/R/readGAlignments.R#L138-L159) that one could actually internally update the scanBamParam() by using the bamRanges() of the BamViews object, which would remove the necessity of providing the ranges a second time with param argument.

I think this would improve usability of the function and just wanted to let the developers of the very good GenomicAlignments package know.

Best,

Alex