I need to calculate the GC content of windows across the entire mouse genome. The windows are not regularly spaced and are different sizes. In total there are 6,665,053 windows ranging from 1,239 bp to 3,102,000 bp. When I try to calculate the GC content of these windows my computer runs out of memory (16Gb RAM). Is there a more memory-efficient way than this approach?
> windowViews <- Views(BSgenome.Mmusculus.UCSC.mm10, windowRanges) > gcFrequency <- letterFrequency(windowViews, letters="GC", as.prob=TRUE)
The variable windowRanges is a single GRanges object containing all the window ranges across the entire genome. If I split the ranges by chromosome, would this load/unload each chromosome? Similar to using bsapply to calculate the GC content of each chromsome?
> param <- new("BSParams", X=BSgenome.Mmusculus.UCSC.mm10 , FUN=letterFrequency) > bsapply(param, letters="GC", as.prob=TRUE)
Right now letterFrequency,BSgenomeViews is loading all of the views into an XStringSet, which will consume a lot of memory. Iterating over the sequences does seem like a better strategy.