Question: Implementing fast mathematical and statistical operations on genomic variables in bioconductor
gravatar for Devin King
2.2 years ago by
Devin King20
Devin King20 wrote:

The current paradigm to represent genomic variables (e.g. ChIP-seq coverage) in Bioconductor is as a named RleList or as a metadata column on a disjoint GRanges object. This can provide remarkable functionality and performance. However, how does one properly implement mathematical or statistical operations on these objects?

Consider the following (based on the genomicvars help page of GenomicRanges):

        BSgen <- BSgenome.Scerevisiae.UCSC.sacCer3
        # Function to make fake artificial genomic variables
              makeVars <- function(BSgen,n) {
                         function(seqlen) {
                           tmp <- sample(n, seqlen, replace=TRUE)
                           Rle(cumsum(tmp - rev(tmp)))
        # Make genomic variables
              x <- makeVars(BSgen,50)
              y <- makeVars(BSgen,50)
              y[y<0] <- 1 # keep y positive 

We can use certain supported functionality for performing mathematical operations, e.g. finding parallel maxima using pmax:

newVar.bioc <- BiocGenerics::pmax(x,y) # works right away and is fast and awesome!

newVar.base <- base::pmax(x,y) # slow and unsatisfying but still works

However, what if I wanted to use other operations that aren't supported, e.g. ppois:

newVar.ppois <- stats::ppois(x,y)

This can be circumvented by manual coercion, but it defeats the purpose of using RleList

x.n <- lapply(x,as.numeric) # defeats the puropse of using RleList
y.n <- lapply(y,as.numeric)

ppois.xy <- RleList(lapply(seq_along(x.n), function(i){
  x.i <- x.n[[i]]
  y.i <- y.n[[i]]
  ppois.i <- ppois(x.i,lambda=y.i)

My question, then, is what is the best practice in bioconductor for performing mathematical/statistical operations like ppois on genomicvars? Does this require extending R with c/c++? I'm a little unsure where to begin and any advice would be much appreciated.


ADD COMMENTlink modified 2.2 years ago by Hervé Pagès ♦♦ 14k • written 2.2 years ago by Devin King20
Answer: Implementing fast mathematical and statistical operations on genomic variables i
gravatar for Hervé Pagès
2.2 years ago by
Hervé Pagès ♦♦ 14k
United States
Hervé Pagès ♦♦ 14k wrote:

Hi Devin,

Just a heads up that the plan is to move away from the named RleList representation or the disjoint GRanges with metadata columns for representing genomic variables. In a not too distant future genomic variables will be better represented as metadata columns on a GPos object.

Note that this is already possible for "small" genomes i.e. for genomes with less than 2^31 nucleotide positions. In order to support longer genomes (e.g. the Human genome) we first need to support long GPos and long Rle objects. Some progress was made a few months ago in the S4Vectors package towards this (long Rle objects are partially supported) but there is still a lot to do. Once we have this, a more natural/intuitive/convenient way to represent the coverage of a full genome will be with an Rle in the metadata column of a GPos object.


ADD COMMENTlink modified 2.2 years ago • written 2.2 years ago by Hervé Pagès ♦♦ 14k

I like GPos and it would be great if it were applicable to human-size genomes. It would be nice to have easy way (simple function call) of moving between GPos and the RLE representation in a GRanges. Sometimes you want to operate on runs, other times positions.

ADD REPLYlink written 2.2 years ago by Michael Lawrence11k

Interesting! I am using functions operating on RleLists extensively (coverage, slice, Views, viewSums, etc), how will they be affected?

ADD REPLYlink written 2.1 years ago by maltethodberg140

I'm not sure yet. We'll need to come up with a plan to make the transition as smooth as possible. But we are not here yet.


ADD REPLYlink written 2.1 years ago by Hervé Pagès ♦♦ 14k
Answer: Implementing fast mathematical and statistical operations on genomic variables i
gravatar for Michael Lawrence
2.2 years ago by
United States
Michael Lawrence11k wrote:

I am not sure whether C/C++ would really help unless there is some library for numerical computing with RLEs. The code you propose for ppois() is OK but I would use mapply() for cleanliness and coerce to the numeric vector within the loop in order to save memory (at a time cost).

We are working to push these types of abstractions down into the internals of R, so more numeric functions will gain support, but it will be a long time before all of the third party native code is ported.

ADD COMMENTlink written 2.2 years ago by Michael Lawrence11k

Map() / mapply() also retain names. mendoapply() returns the same object as input, avoiding explicit coercion to RleList.

ppoisRle <- function(q, lambda, ...) {
    q <- as.numeric(q)
    lambda <- as.numeric(lambda)
    value <- ppois(q = q, lambda = lambda, ...)
ppois.xy <- RleList(Map(ppois, x.n, y.n))
ppois.xy.2a <- RleList(Map(ppoisRle, x, y))
ppois.xy.2b <- mendoapply(ppoisRle, x, y)
identical(ppois.xy, ppois.xy.2a)           # TRUE
identical(ppois.xy.2a, ppois.xy.2b)        # TRUE

I don't think there's a meaningful time-difference between doing the coercion to / from integer in or outside the mapping, but obviously a space savings.

Using the same amount of memory as the original, a faster (about 2 vs. 3 seconds for me, 1.75s in ppois()) calculation is the unlist / relist trick (ok for smaller genomes!)

x.all <- as.numeric(unlist(unname(x)))
y.all <- as.numeric(unlist(unname(y)))
value.all <- ppois(x.all, y.all)
ppois.all <- relist(Rle(value.all), x)



ADD REPLYlink written 2.2 years ago by Martin Morgan ♦♦ 24k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 16.09
Traffic: 198 users visited in the last hour