Search
Question: Memory-efficient method to calculate GC content of millions of GRanges
0
18 months ago by
jma199130
jma199130 wrote:

I need to calculate the GC content of windows across the entire mouse genome. The windows are not regularly spaced and are different sizes. In total there are 6,665,053 windows ranging from 1,239 bp to 3,102,000 bp. When I try to calculate the GC content of these windows my computer runs out of memory (16Gb RAM). Is there a more memory-efficient way than this approach?

> windowViews <- Views(BSgenome.Mmusculus.UCSC.mm10, windowRanges)
> gcFrequency <- letterFrequency(windowViews, letters="GC", as.prob=TRUE)

The variable windowRanges is a single GRanges object containing all the window ranges across the entire genome. If I split the ranges by chromosome, would this load/unload each chromosome? Similar to using bsapply to calculate the GC content of each chromsome?

> param <- new("BSParams", X=BSgenome.Mmusculus.UCSC.mm10 , FUN=letterFrequency)
> bsapply(param, letters="GC", as.prob=TRUE)
modified 18 months ago by Hervé Pagès ♦♦ 13k • written 18 months ago by jma199130

Right now letterFrequency,BSgenomeViews is loading all of the views into an XStringSet, which will consume a lot of memory. Iterating over the sequences does seem like a better strategy.

5
18 months ago by
Hervé Pagès ♦♦ 13k
United States
Hervé Pagès ♦♦ 13k wrote:

Hi,

Iterating over the sequences should help. Another strategy is to process the BSgenomeViews object itself by chunk:

## A memory-efficient "letterFrequency" method for
## BSgenomeViews objects.
letterFrequency2 <- function(x, letters, OR="|",
as.prob=FALSE, ...)
{
stopifnot(is(x, "BSgenomeViews"))
chunksize <- 500000L
chunks <- breakInChunks(length(x), chunksize)
chunks <- as(chunks, "IRanges")
ans_chunks <- lapply(seq_along(chunks),
function(i) {
x_chunk <- extractROWS(x, chunks[i])
letterFrequency(x_chunk, letters, OR=OR,
as.prob=as.prob)
})
do.call(rbind, ans_chunks)
}


Then:

library(BSgenome.Mmusculus.UCSC.mm10)
windowViews <- Views(BSgenome.Mmusculus.UCSC.mm10, windowRanges)
gcFrequency <- letterFrequency2(windowViews, letters="GC", as.prob=TRUE)

At some point I should improve the current implementation of the "letterFrequency" method for BSgenomeViews objects along this idea.

Cheers,

H.

Thank you for the reply. One question, does creating a BSgenome views object not also load the entire genome into memory?

1

Hi,

Creating the BSgenomeViews object doesn't load anything from disk. And displaying it only loads the content of the views that are being displayed.

Note that the above implementation of letterFrequency2 loads the content of 500000 views at a time. You can change this by modifying the value of chunksize. I'm thinking of making this a parameter that is stored in the BSgenomeViews object itself and that the user would specify at construction time. There would also be a chunksize getter and setter to let the user access the value and change it afterwards.

Cheers,

H.