Question

Memory-efficient method to calculate GC content of millions of GRanges

0

Entering edit mode

jma1991 ▴ 70

@jma1991-11856

Last seen 2.9 years ago

Cumbernauld

I need to calculate the GC content of windows across the entire mouse genome. The windows are not regularly spaced and are different sizes. In total there are 6,665,053 windows ranging from 1,239 bp to 3,102,000 bp. When I try to calculate the GC content of these windows my computer runs out of memory (16Gb RAM). Is there a more memory-efficient way than this approach?

> windowViews <- Views(BSgenome.Mmusculus.UCSC.mm10, windowRanges)
> gcFrequency <- letterFrequency(windowViews, letters="GC", as.prob=TRUE)

The variable windowRanges is a single GRanges object containing all the window ranges across the entire genome. If I split the ranges by chromosome, would this load/unload each chromosome? Similar to using bsapply to calculate the GC content of each chromsome?

> param <- new("BSParams", X=BSgenome.Mmusculus.UCSC.mm10 , FUN=letterFrequency)
> bsapply(param, letters="GC", as.prob=TRUE)

granges biostrings bsapply BSgenome • 3.1k views

ADD COMMENT • link updated 9.2 years ago by Hervé Pagès 16k • written 9.2 years ago by jma1991 ▴ 70

0

Entering edit mode

Right now letterFrequency,BSgenomeViews is loading all of the views into an XStringSet, which will consume a lot of memory. Iterating over the sequences does seem like a better strategy.

ADD REPLY • link 9.2 years ago Michael Lawrence ★ 11k

score 5 · Accepted Answer · 2016-11-17

5

Entering edit mode

Hervé Pagès 16k

@herve-pages-1542

Last seen 21 hours ago

Seattle, WA, United States

Hi,

Iterating over the sequences should help. Another strategy is to process the BSgenomeViews object itself by chunk:

## A memory-efficient "letterFrequency" method for
## BSgenomeViews objects.
letterFrequency2 <- function(x, letters, OR="|",
                             as.prob=FALSE, ...)
{
    stopifnot(is(x, "BSgenomeViews"))
    chunksize <- 500000L
    chunks <- breakInChunks(length(x), chunksize)
    chunks <- as(chunks, "IRanges")
    ans_chunks <- lapply(seq_along(chunks),
        function(i) {
            x_chunk <- extractROWS(x, chunks[i])
            letterFrequency(x_chunk, letters, OR=OR,
                            as.prob=as.prob)
        })
    do.call(rbind, ans_chunks)
}

Then:

library(BSgenome.Mmusculus.UCSC.mm10)
windowViews <- Views(BSgenome.Mmusculus.UCSC.mm10, windowRanges)
gcFrequency <- letterFrequency2(windowViews, letters="GC", as.prob=TRUE)

At some point I should improve the current implementation of the "letterFrequency" method for BSgenomeViews objects along this idea.

Cheers,

H.

ADD COMMENT • link 9.2 years ago Hervé Pagès 16k

0

Entering edit mode

Thank you for the reply. One question, does creating a BSgenome views object not also load the entire genome into memory?

ADD REPLY • link 9.2 years ago jma1991 ▴ 70

1

Entering edit mode

Hi,

Creating the BSgenomeViews object doesn't load anything from disk. And displaying it only loads the content of the views that are being displayed.

Note that the above implementation of letterFrequency2 loads the content of 500000 views at a time. You can change this by modifying the value of chunksize. I'm thinking of making this a parameter that is stored in the BSgenomeViews object itself and that the user would specify at construction time. There would also be a chunksize getter and setter to let the user access the value and change it afterwards.