Implementing fast mathematical and statistical operations on genomic variables in bioconductor
2
2
Entering edit mode
Devin King ▴ 20
@devin-king-13481
Last seen 6.6 years ago

The current paradigm to represent genomic variables (e.g. ChIP-seq coverage) in Bioconductor is as a named RleList or as a metadata column on a disjoint GRanges object. This can provide remarkable functionality and performance. However, how does one properly implement mathematical or statistical operations on these objects?

Consider the following (based on the genomicvars help page of GenomicRanges):

library(BSgenome.Scerevisiae.UCSC.sacCer3)
        BSgen <- BSgenome.Scerevisiae.UCSC.sacCer3
        # Function to make fake artificial genomic variables
              makeVars <- function(BSgen,n) {
                RleList(
                  lapply(seqlengths(Scerevisiae),
                         function(seqlen) {
                           tmp <- sample(n, seqlen, replace=TRUE)
                           Rle(cumsum(tmp - rev(tmp)))
                         })
                )
              }
        # Make genomic variables
              set.seed(17)
              x <- makeVars(BSgen,50)
              set.seed(300)
              y <- makeVars(BSgen,50)
              y[y<0] <- 1 # keep y positive 

We can use certain supported functionality for performing mathematical operations, e.g. finding parallel maxima using pmax:

newVar.bioc <- BiocGenerics::pmax(x,y) # works right away and is fast and awesome!

newVar.base <- base::pmax(x,y) # slow and unsatisfying but still works
identical(newVar.base,newVar.bioc)

However, what if I wanted to use other operations that aren't supported, e.g. ppois:

newVar.ppois <- stats::ppois(x,y)

This can be circumvented by manual coercion, but it defeats the purpose of using RleList

x.n <- lapply(x,as.numeric) # defeats the puropse of using RleList
y.n <- lapply(y,as.numeric)

ppois.xy <- RleList(lapply(seq_along(x.n), function(i){
  x.i <- x.n[[i]]
  y.i <- y.n[[i]]
  ppois.i <- ppois(x.i,lambda=y.i)
}))

My question, then, is what is the best practice in bioconductor for performing mathematical/statistical operations like ppois on genomicvars? Does this require extending R with c/c++? I'm a little unsure where to begin and any advice would be much appreciated.

Devin

genomicranges biocgenerics rlelist genomicvars iranges • 1.8k views
ADD COMMENT
2
Entering edit mode
@herve-pages-1542
Last seen 2 days ago
Seattle, WA, United States

Hi Devin,

Just a heads up that the plan is to move away from the named RleList representation or the disjoint GRanges with metadata columns for representing genomic variables. In a not too distant future genomic variables will be better represented as metadata columns on a GPos object.

Note that this is already possible for "small" genomes i.e. for genomes with less than 2^31 nucleotide positions. In order to support longer genomes (e.g. the Human genome) we first need to support long GPos and long Rle objects. Some progress was made a few months ago in the S4Vectors package towards this (long Rle objects are partially supported) but there is still a lot to do. Once we have this, a more natural/intuitive/convenient way to represent the coverage of a full genome will be with an Rle in the metadata column of a GPos object.

H.

ADD COMMENT
0
Entering edit mode

I like GPos and it would be great if it were applicable to human-size genomes. It would be nice to have easy way (simple function call) of moving between GPos and the RLE representation in a GRanges. Sometimes you want to operate on runs, other times positions.

ADD REPLY
0
Entering edit mode

Interesting! I am using functions operating on RleLists extensively (coverage, slice, Views, viewSums, etc), how will they be affected?

ADD REPLY
0
Entering edit mode

I'm not sure yet. We'll need to come up with a plan to make the transition as smooth as possible. But we are not here yet.

H.

ADD REPLY
1
Entering edit mode
@michael-lawrence-3846
Last seen 3.0 years ago
United States

I am not sure whether C/C++ would really help unless there is some library for numerical computing with RLEs. The code you propose for ppois() is OK but I would use mapply() for cleanliness and coerce to the numeric vector within the loop in order to save memory (at a time cost).

We are working to push these types of abstractions down into the internals of R, so more numeric functions will gain support, but it will be a long time before all of the third party native code is ported.

ADD COMMENT
1
Entering edit mode

Map() / mapply() also retain names. mendoapply() returns the same object as input, avoiding explicit coercion to RleList.

ppoisRle <- function(q, lambda, ...) {
    q <- as.numeric(q)
    lambda <- as.numeric(lambda)
    value <- ppois(q = q, lambda = lambda, ...)
    Rle(value)
}
ppois.xy <- RleList(Map(ppois, x.n, y.n))
ppois.xy.2a <- RleList(Map(ppoisRle, x, y))
ppois.xy.2b <- mendoapply(ppoisRle, x, y)
identical(ppois.xy, ppois.xy.2a)           # TRUE
identical(ppois.xy.2a, ppois.xy.2b)        # TRUE

I don't think there's a meaningful time-difference between doing the coercion to / from integer in or outside the mapping, but obviously a space savings.

Using the same amount of memory as the original, a faster (about 2 vs. 3 seconds for me, 1.75s in ppois()) calculation is the unlist / relist trick (ok for smaller genomes!)

x.all <- as.numeric(unlist(unname(x)))
y.all <- as.numeric(unlist(unname(y)))
value.all <- ppois(x.all, y.all)
ppois.all <- relist(Rle(value.all), x)

 

 

ADD REPLY

Login before adding your answer.

Traffic: 614 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6