Search
Question: Implementing fast mathematical and statistical operations on genomic variables in bioconductor
2
15 months ago by
Devin King20
Devin King20 wrote:

The current paradigm to represent genomic variables (e.g. ChIP-seq coverage) in Bioconductor is as a named RleList or as a metadata column on a disjoint GRanges object. This can provide remarkable functionality and performance. However, how does one properly implement mathematical or statistical operations on these objects?

Consider the following (based on the genomicvars help page of GenomicRanges):

library(BSgenome.Scerevisiae.UCSC.sacCer3)
BSgen <- BSgenome.Scerevisiae.UCSC.sacCer3
# Function to make fake artificial genomic variables
makeVars <- function(BSgen,n) {
RleList(
lapply(seqlengths(Scerevisiae),
function(seqlen) {
tmp <- sample(n, seqlen, replace=TRUE)
Rle(cumsum(tmp - rev(tmp)))
})
)
}
# Make genomic variables
set.seed(17)
x <- makeVars(BSgen,50)
set.seed(300)
y <- makeVars(BSgen,50)
y[y<0] <- 1 # keep y positive 

We can use certain supported functionality for performing mathematical operations, e.g. finding parallel maxima using pmax:

newVar.bioc <- BiocGenerics::pmax(x,y) # works right away and is fast and awesome!

newVar.base <- base::pmax(x,y) # slow and unsatisfying but still works
identical(newVar.base,newVar.bioc)

However, what if I wanted to use other operations that aren't supported, e.g. ppois:

newVar.ppois <- stats::ppois(x,y)

This can be circumvented by manual coercion, but it defeats the purpose of using RleList

x.n <- lapply(x,as.numeric) # defeats the puropse of using RleList
y.n <- lapply(y,as.numeric)

ppois.xy <- RleList(lapply(seq_along(x.n), function(i){
x.i <- x.n[[i]]
y.i <- y.n[[i]]
ppois.i <- ppois(x.i,lambda=y.i)
}))

My question, then, is what is the best practice in bioconductor for performing mathematical/statistical operations like ppois on genomicvars? Does this require extending R with c/c++? I'm a little unsure where to begin and any advice would be much appreciated.

Devin

modified 14 months ago by Hervé Pagès ♦♦ 13k • written 15 months ago by Devin King20
2
14 months ago by
Hervé Pagès ♦♦ 13k
United States
Hervé Pagès ♦♦ 13k wrote:

Hi Devin,

Just a heads up that the plan is to move away from the named RleList representation or the disjoint GRanges with metadata columns for representing genomic variables. In a not too distant future genomic variables will be better represented as metadata columns on a GPos object.

Note that this is already possible for "small" genomes i.e. for genomes with less than 2^31 nucleotide positions. In order to support longer genomes (e.g. the Human genome) we first need to support long GPos and long Rle objects. Some progress was made a few months ago in the S4Vectors package towards this (long Rle objects are partially supported) but there is still a lot to do. Once we have this, a more natural/intuitive/convenient way to represent the coverage of a full genome will be with an Rle in the metadata column of a GPos object.

H.

I like GPos and it would be great if it were applicable to human-size genomes. It would be nice to have easy way (simple function call) of moving between GPos and the RLE representation in a GRanges. Sometimes you want to operate on runs, other times positions.

Interesting! I am using functions operating on RleLists extensively (coverage, slice, Views, viewSums, etc), how will they be affected?

I'm not sure yet. We'll need to come up with a plan to make the transition as smooth as possible. But we are not here yet.

H.

1
15 months ago by
United States
Michael Lawrence10k wrote:

I am not sure whether C/C++ would really help unless there is some library for numerical computing with RLEs. The code you propose for ppois() is OK but I would use mapply() for cleanliness and coerce to the numeric vector within the loop in order to save memory (at a time cost).

We are working to push these types of abstractions down into the internals of R, so more numeric functions will gain support, but it will be a long time before all of the third party native code is ported.

1

Map() / mapply() also retain names. mendoapply() returns the same object as input, avoiding explicit coercion to RleList.

ppoisRle <- function(q, lambda, ...) {
q <- as.numeric(q)
lambda <- as.numeric(lambda)
value <- ppois(q = q, lambda = lambda, ...)
Rle(value)
}
ppois.xy <- RleList(Map(ppois, x.n, y.n))
ppois.xy.2a <- RleList(Map(ppoisRle, x, y))
ppois.xy.2b <- mendoapply(ppoisRle, x, y)
identical(ppois.xy, ppois.xy.2a)           # TRUE
identical(ppois.xy.2a, ppois.xy.2b)        # TRUE


I don't think there's a meaningful time-difference between doing the coercion to / from integer in or outside the mapping, but obviously a space savings.

Using the same amount of memory as the original, a faster (about 2 vs. 3 seconds for me, 1.75s in ppois()) calculation is the unlist / relist trick (ok for smaller genomes!)

x.all <- as.numeric(unlist(unname(x)))
y.all <- as.numeric(unlist(unname(y)))
value.all <- ppois(x.all, y.all)
ppois.all <- relist(Rle(value.all), x)