Question: Memory issues in summarizeOverlaps funtion
0
10 weeks ago by
Diana10
Diana10 wrote:

Hi all,

I get a memory error ('error: cannot allocate vector of size 344.5 Mb') when running summarizeOverlaps in the Genomic alignments package. I have 4 GB RAM (with about 3.8 GB free space) and I use 64 bits R. I also increased the memory.limit size to 3500 and I tried -- vanilla as well. Nothing seems to work. Do you have any ideas? Thanks a lot!

summarizeoverlaps memory • 129 views
modified 10 weeks ago by James W. MacDonald49k • written 10 weeks ago by Diana10
Answer: Memory issues in summarizeOverlaps funtion
1
10 weeks ago by
United States
James W. MacDonald49k wrote:

Assuming you are reading in data from BAM files, you should try reading the data in chunks. See ?BamFile, particularly the yieldSize argument, and the examples which show how it's used.

Hi James,

Thanks for your answer! Yes, I am reading BAM files. I know the yieldSize argument, but the file itself is about 500 MB, so isn't the memory error a bit strange? What could be an explanation besides low RAM memory (which is not the case)?

You say you are reading BAM files, but then you say 'the file itself', so it's not clear if you are reading in one or more files. Anyway, having a computer with 4 Gb RAM doesn't mean you actually have that much RAM to allocate to R. It may be much less, depending on what else you have running. And reading in a 500 Mb file will probably take more RAM than you would expect, given underlying copies that may be created. And if you are on Windows, which sometimes has problems releasing memory, that might be exacerbated.

I wouldn't use a Windows box with 4 Gb RAM for really basic stuff (16 Gb RAM is about the lowest I would go, even for casual use), so it's not surprising to me at all that you would run out of RAM trying to do something real.

You say that you 'know the yieldSize argument'. Does that mean you are using it, or just that you know it exists?

Sorry, currently I am reading in one file. As for the yieldSize argument, I know it exists. I haven't yet tried it, as I assumed it would take a looong time to read the whole file in seperate chunks. I will try it with a yieldSize of 2000000 to start with.

> bfl <- BamFileList("../../data/star_aligned/303360Aligned.sortedByCoord.out.bam")
> system.time(summarizeOverlaps(ensex, bfl))
user  system elapsed
233.280  34.764 268.371

> bfl <- BamFileList("../../data/star_aligned/303360Aligned.sortedByCoord.out.bam", yieldSize = 2e5)
> system.time(summarizeOverlaps(ensex, bfl))
user  system elapsed
222.436   3.960 226.655


Hi, I have still one question about the reduceByYield argument. I have the following code:

> csvfile <- file.path("W29-1-1.csv")
> sampleTable
File
1 W29-1-1-B
2 W29-1-1-F
> setwd("C:/Program Files/BAM files")
> filename <- file.path(paste0(sampleTable\$File, "_aligned_genome_anonymized.sorted29.bam"))
> file.exists(filename)
[1] TRUE TRUE
> library("Rsamtools")
> library(GenomicFiles)
> library(GenomicFeatures)
> library(GenomicRanges)
> library("GenomicAlignments")
> library("BiocParallel")
> library("Rsamtools")
> bamfiles <- BamFileList(filename, yieldSize=2000000)
x <- bamfiles
reduceByYield(x, YIELD, MAP=identity, REDUCE='+', parallel=FALSE)


However, I get the following error:

> Error in (function (classes, fdef, mtable)  :    unable to find an
> inherited method for function ‘readGAlignments’ for signature
> ‘"BamFileList"’


My following steps are counting reads with summarizeOverlaps and performing a differential expression analysis with edgeR. This works fine with my current Yieldsize of 2000000, but I want to perform these analysis on complete BAM-files. Do you know how I can make this reduceByYield argument work?

Why are you doing that? Simply passing a BamFileList to summarizeOverlaps where you have specified the yieldSize for the BamFileList will cause the data to be read in chunks.

Really? So simply running se will actually count all reads? That would be great... But how is it possible that tail(assay(se)) gives 9997 as last row and rowRanges(se) gives an object of length 25892? I am sorry for asking these probably basic questions...

I think you might be confused. The row names for a SummarizedExperiment are the underlying IDs (which in your case might be Entrez Gene IDs? The yieldSize argument simply sets the chunk size for the data being read in, not the total amount of data to read in:

> bams <- c("303301Aligned.sortedByCoord.out.bam","303362Aligned.sortedByCoord.out.bam")
> bfl <- BamFileList(bams)
> se_all <- summarizeOverlaps(ensex, bfl)
> bfl <- BamFileList(bams, yieldSize = 2e5)
> se_by_yield <- summarizeOverlaps(ensex, bfl)
> se_all
class: RangedSummarizedExperiment
dim: 225589 2
assays(1): counts
rownames(225589): ENSSSCG00000000002 ENSSSCG00000000002 ...
ENSSSCG00000040989 ENSSSCG00000040989
rowData names(0):
colnames(2): 303301Aligned.sortedByCoord.out.bam
303362Aligned.sortedByCoord.out.bam
colData names(0):
> se_by_yield
class: RangedSummarizedExperiment
dim: 225589 2
assays(1): counts
rownames(225589): ENSSSCG00000000002 ENSSSCG00000000002 ...
ENSSSCG00000040989 ENSSSCG00000040989
rowData names(0):
colnames(2): 303301Aligned.sortedByCoord.out.bam
303362Aligned.sortedByCoord.out.bam
colData names(0):


Please note that the dim for both SummarizedExperiments are identical, and that the rownames are (in this case) Ensembl Gene IDs.