Search
Question: Implementing fast mathematical and statistical operations on genomic variables in bioconductor
2
gravatar for Devin King
9 weeks ago by
Devin King20
Devin King20 wrote:

The current paradigm to represent genomic variables (e.g. ChIP-seq coverage) in Bioconductor is as a named RleList or as a metadata column on a disjoint GRanges object. This can provide remarkable functionality and performance. However, how does one properly implement mathematical or statistical operations on these objects?

Consider the following (based on the genomicvars help page of GenomicRanges):

library(BSgenome.Scerevisiae.UCSC.sacCer3)
        BSgen <- BSgenome.Scerevisiae.UCSC.sacCer3
        # Function to make fake artificial genomic variables
              makeVars <- function(BSgen,n) {
                RleList(
                  lapply(seqlengths(Scerevisiae),
                         function(seqlen) {
                           tmp <- sample(n, seqlen, replace=TRUE)
                           Rle(cumsum(tmp - rev(tmp)))
                         })
                )
              }
        # Make genomic variables
              set.seed(17)
              x <- makeVars(BSgen,50)
              set.seed(300)
              y <- makeVars(BSgen,50)
              y[y<0] <- 1 # keep y positive 

We can use certain supported functionality for performing mathematical operations, e.g. finding parallel maxima using pmax:

newVar.bioc <- BiocGenerics::pmax(x,y) # works right away and is fast and awesome!

newVar.base <- base::pmax(x,y) # slow and unsatisfying but still works
identical(newVar.base,newVar.bioc)

However, what if I wanted to use other operations that aren't supported, e.g. ppois:

newVar.ppois <- stats::ppois(x,y)

This can be circumvented by manual coercion, but it defeats the purpose of using RleList

x.n <- lapply(x,as.numeric) # defeats the puropse of using RleList
y.n <- lapply(y,as.numeric)

ppois.xy <- RleList(lapply(seq_along(x.n), function(i){
  x.i <- x.n[[i]]
  y.i <- y.n[[i]]
  ppois.i <- ppois(x.i,lambda=y.i)
}))

My question, then, is what is the best practice in bioconductor for performing mathematical/statistical operations like ppois on genomicvars? Does this require extending R with c/c++? I'm a little unsure where to begin and any advice would be much appreciated.

Devin

ADD COMMENTlink modified 8 weeks ago by Hervé Pagès ♦♦ 13k • written 9 weeks ago by Devin King20
2
gravatar for Hervé Pagès
8 weeks ago by
Hervé Pagès ♦♦ 13k
United States
Hervé Pagès ♦♦ 13k wrote:

Hi Devin,

Just a heads up that the plan is to move away from the named RleList representation or the disjoint GRanges with metadata columns for representing genomic variables. In a not too distant future genomic variables will be better represented as metadata columns on a GPos object.

Note that this is already possible for "small" genomes i.e. for genomes with less than 2^31 nucleotide positions. In order to support longer genomes (e.g. the Human genome) we first need to support long GPos and long Rle objects. Some progress was made a few months ago in the S4Vectors package towards this (long Rle objects are partially supported) but there is still a lot to do. Once we have this, a more natural/intuitive/convenient way to represent the coverage of a full genome will be with an Rle in the metadata column of a GPos object.

H.

ADD COMMENTlink modified 8 weeks ago • written 8 weeks ago by Hervé Pagès ♦♦ 13k

I like GPos and it would be great if it were applicable to human-size genomes. It would be nice to have easy way (simple function call) of moving between GPos and the RLE representation in a GRanges. Sometimes you want to operate on runs, other times positions.

ADD REPLYlink written 8 weeks ago by Michael Lawrence9.8k

Interesting! I am using functions operating on RleLists extensively (coverage, slice, Views, viewSums, etc), how will they be affected?

ADD REPLYlink written 7 weeks ago by maltethodberg40

I'm not sure yet. We'll need to come up with a plan to make the transition as smooth as possible. But we are not here yet.

H.

ADD REPLYlink written 7 weeks ago by Hervé Pagès ♦♦ 13k
1
gravatar for Michael Lawrence
9 weeks ago by
United States
Michael Lawrence9.8k wrote:

I am not sure whether C/C++ would really help unless there is some library for numerical computing with RLEs. The code you propose for ppois() is OK but I would use mapply() for cleanliness and coerce to the numeric vector within the loop in order to save memory (at a time cost).

We are working to push these types of abstractions down into the internals of R, so more numeric functions will gain support, but it will be a long time before all of the third party native code is ported.

ADD COMMENTlink written 9 weeks ago by Michael Lawrence9.8k
1

Map() / mapply() also retain names. mendoapply() returns the same object as input, avoiding explicit coercion to RleList.

ppoisRle <- function(q, lambda, ...) {
    q <- as.numeric(q)
    lambda <- as.numeric(lambda)
    value <- ppois(q = q, lambda = lambda, ...)
    Rle(value)
}
ppois.xy <- RleList(Map(ppois, x.n, y.n))
ppois.xy.2a <- RleList(Map(ppoisRle, x, y))
ppois.xy.2b <- mendoapply(ppoisRle, x, y)
identical(ppois.xy, ppois.xy.2a)           # TRUE
identical(ppois.xy.2a, ppois.xy.2b)        # TRUE

I don't think there's a meaningful time-difference between doing the coercion to / from integer in or outside the mapping, but obviously a space savings.

Using the same amount of memory as the original, a faster (about 2 vs. 3 seconds for me, 1.75s in ppois()) calculation is the unlist / relist trick (ok for smaller genomes!)

x.all <- as.numeric(unlist(unname(x)))
y.all <- as.numeric(unlist(unname(y)))
value.all <- ppois(x.all, y.all)
ppois.all <- relist(Rle(value.all), x)

 

 

ADD REPLYlink written 8 weeks ago by Martin Morgan ♦♦ 20k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.2.0
Traffic: 254 users visited in the last hour