Search
Question: Memory-efficient method to calculate GC content of millions of GRanges
0
gravatar for jma1991
13 months ago by
jma199130
jma199130 wrote:

I need to calculate the GC content of windows across the entire mouse genome. The windows are not regularly spaced and are different sizes. In total there are 6,665,053 windows ranging from 1,239 bp to 3,102,000 bp. When I try to calculate the GC content of these windows my computer runs out of memory (16Gb RAM). Is there a more memory-efficient way than this approach?

> windowViews <- Views(BSgenome.Mmusculus.UCSC.mm10, windowRanges)
> gcFrequency <- letterFrequency(windowViews, letters="GC", as.prob=TRUE)

The variable windowRanges is a single GRanges object containing all the window ranges across the entire genome. If I split the ranges by chromosome, would this load/unload each chromosome? Similar to using bsapply to calculate the GC content of each chromsome?

> param <- new("BSParams", X=BSgenome.Mmusculus.UCSC.mm10 , FUN=letterFrequency)
> bsapply(param, letters="GC", as.prob=TRUE)
ADD COMMENTlink modified 13 months ago by Hervé Pagès ♦♦ 13k • written 13 months ago by jma199130

Right now letterFrequency,BSgenomeViews is loading all of the views into an XStringSet, which will consume a lot of memory. Iterating over the sequences does seem like a better strategy.
 

ADD REPLYlink written 13 months ago by Michael Lawrence9.8k
5
gravatar for Hervé Pagès
13 months ago by
Hervé Pagès ♦♦ 13k
United States
Hervé Pagès ♦♦ 13k wrote:

Hi,

Iterating over the sequences should help. Another strategy is to process the BSgenomeViews object itself by chunk:

## A memory-efficient "letterFrequency" method for
## BSgenomeViews objects.
letterFrequency2 <- function(x, letters, OR="|",
                             as.prob=FALSE, ...)
{
    stopifnot(is(x, "BSgenomeViews"))
    chunksize <- 500000L
    chunks <- breakInChunks(length(x), chunksize)
    chunks <- as(chunks, "IRanges")
    ans_chunks <- lapply(seq_along(chunks),
        function(i) {
            x_chunk <- extractROWS(x, chunks[i])
            letterFrequency(x_chunk, letters, OR=OR,
                            as.prob=as.prob)
        })
    do.call(rbind, ans_chunks)
}

Then:

library(BSgenome.Mmusculus.UCSC.mm10)
windowViews <- Views(BSgenome.Mmusculus.UCSC.mm10, windowRanges)
gcFrequency <- letterFrequency2(windowViews, letters="GC", as.prob=TRUE)

At some point I should improve the current implementation of the "letterFrequency" method for BSgenomeViews objects along this idea.

Cheers,

H.

ADD COMMENTlink modified 13 months ago • written 13 months ago by Hervé Pagès ♦♦ 13k

Thank you for the reply. One question, does creating a BSgenome views object not also load the entire genome into memory?

ADD REPLYlink written 12 months ago by jma199130
1

Hi,

Creating the BSgenomeViews object doesn't load anything from disk. And displaying it only loads the content of the views that are being displayed.

Note that the above implementation of letterFrequency2 loads the content of 500000 views at a time. You can change this by modifying the value of chunksize. I'm thinking of making this a parameter that is stored in the BSgenomeViews object itself and that the user would specify at construction time. There would also be a chunksize getter and setter to let the user access the value and change it afterwards.

Cheers,

H.

ADD REPLYlink written 12 months ago by Hervé Pagès ♦♦ 13k

Great, thanks for your reply.

ADD REPLYlink written 12 months ago by jma199130
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.2.0
Traffic: 167 users visited in the last hour