Do the gene set statistics of the ROAST paper correspond to those implemented in the .roastEffects R function?
1
0
Entering edit mode
@andrea-carenzo-15841
Last seen 3.7 years ago
Università degli Studi di Torino

Hello to everyone,

I'm a mathematician and I'm writing a dissertation about the use of different gene set tests for the overrepresentation of lists of genes coming from several experiments. I'm only considering three (but very different) procedures: CAMERA, ROAST and ROMER, all coming from the limma package.

For now I'm focusing on the ROAST procedure and reading the paper by Wu et al. (ROAST: rotation gene set tests for complex microarray experiments, 2010) I've learned how the gene set statistics "mean" and "msq" are defined (theoretically). For having some doubt about what was intended (in section 3.3 of the paper) for "weighted mean" (were the weights considered only in the numerator or in the denominator too?) in the "mean50" statistic, I decided to look at the R code: the roast.default function calls .roastEffects, which performs the core tasks of the procedure, and inside of it are defined the test statistics. Here I've noticed that "mean" and "msq" are defined as follows:

> limma:::.roastEffects
function (effects, gene.weights = NULL, set.statistic = "mean",
var.prior, df.prior, var.post, nrot = 999, approx.zscore = TRUE)
{

...

switch(set.statistic, mean = {
if (!is.null(gene.weights)) modt <- gene.weights * modt
m <- mean(modt)
statobs["down"] <- -m
statobs["up"] <- m
statobs["mixed"] <- mean(abs(modt))

...

}, msq = {
modt2 <- modt^2
if (!is.null(gene.weights)) {
modt2 <- abs(gene.weights) * modt2
modt <- gene.weights * modt
}
statobs["down"] <- sum(modt2[modt < 0])/nset
statobs["up"] <- sum(modt2[modt > 0])/nset
statobs["upordown"] <- max(statobs[c("down", "up")])
statobs["mixed"] <- mean(modt2)

...

}

that is, they are a bit different from how they are defined in the paper: the gene weights are only in the numerator and A, the sum of the absolute values of the weights, is not defined. If we follow what is written in the paper, we would have:

...
A <- sum(abs(gene.weights))
mean = {
if (!is.null(gene.weights)) modt <- gene.weights * modt
m <- sum(modt)/A
statobs["down"] <- -m
statobs["up"] <- m
statobs["mixed"] <- sum(abs(modt))/A

...

}, msq = {
modt2 <- modt^2
if (!is.null(gene.weights)) {
modt2 <- abs(gene.weights) * modt2
modt <- gene.weights * modt
}
statobs["down"] <- sum(modt2[modt < 0])/A
statobs["up"] <- sum(modt2[modt > 0])/A
statobs["upordown"] <- max(statobs[c("down", "up")])
statobs["mixed"] <- sum(modt2)/A

...


So my question is: what are the correct definitions? The code ones or the paper ones? Maybe this is a minor thing but I would like to be sure what to write in the dissertation. Thanks in advance for the answer.

P.S.: I'm using the most recent versions of both R and Bioconductor, as you can see from the sessionInfo output:

> sessionInfo()
R version 3.5.1 (2018-07-02)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 16.04.5 LTS

Matrix products: default
BLAS: /usr/lib/libblas/libblas.so.3.6.0
LAPACK: /usr/lib/lapack/liblapack.so.3.6.0

locale:
 LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C               LC_TIME=en_US.UTF-8
 LC_COLLATE=en_US.UTF-8     LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8
 LC_TELEPHONE=C             LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C

attached base packages:
 stats     graphics  grDevices utils     datasets  methods   base

other attached packages:
 BiocInstaller_1.30.0 limma_3.36.2

loaded via a namespace (and not attached):
 compiler_3.5.1 tools_3.5.1  
1
Entering edit mode
@gordon-smyth
Last seen 35 minutes ago
WEHI, Melbourne, Australia

The roast tests implemented in the limma package are the same as described in the article by Di Wu and myself in Bioinformatics.

It makes no difference whether the set statistics are divided by A or not, because the rotation p-values depend only on the rank order of the observed statistics relative to the rotated statistics. Dividing all the statistics by a constant factor makes no difference to the ranking or (hence) to the p-value.

The factor of A was included in the article for pedagogic value -- it shows that the statistic is equivalent to a weighted mean. The factor is omitted in the code though because it's computationally unnecessary.

0
Entering edit mode

Thanks for the reply mr. Smyth. I have some other (and I think last) questions about this topic: shouldn't the "floormean" statistic for the mixed hypothesis be defined as:

amodt <- pmax(abs(modt),sqrt(chimed))

and what's the meaning of the square-root of the median of the chi-squared distribution with 1 df? I can only think of it as a "minimum positive threshold", but I can't understand why you chose that particular quantity and not another one.

0
Entering edit mode

Yes, you are right, chimed should be square-rooted. Thanks for pointing it out.

The reason for choosing this value is to make the mixed floormean statistics analogous to the directional floormean statistics. The directional floormean statstics are computed using

pmax(modt, 0)

Here 0 is the median of the distribution of the null distribution of modt. When the null hypothesis is true, we expect the upper 50% of the modt values to be left as they are and the lower 50% of values to be reset to the median.

The mixed statistics are analogous. For them we use

pmax(abs(modt), 0.67)

where 0.67 is the median of the distribution of abs(modt) when modt is distributed as N(0,1). So again we leave the upper 50% of amodt values unchanged and the lower 50% are reset to the median. The value of 0.67 is chosen to make the mixed statistics as closely analogous to the directional statistics as possible.

0
Entering edit mode

Now I've got it, thanks again!