The support.bioconductor.org editor has been updated to markdown! Please see more info at: Tutorial: Updated Support Site Editor

Question: Do the gene set statistics of the ROAST paper correspond to those implemented in the .roastEffects R function?
0
gravatar for Andrea Carenzo
5 months ago by
Università degli Studi di Torino
Andrea Carenzo0 wrote:

Hello to everyone,

I'm a mathematician and I'm writing a dissertation about the use of different gene set tests for the overrepresentation of lists of genes coming from several experiments. I'm only considering three (but very different) procedures: CAMERA, ROAST and ROMER, all coming from the limma package.

For now I'm focusing on the ROAST procedure and reading the paper by Wu et al. (ROAST: rotation gene set tests for complex microarray experiments, 2010) I've learned how the gene set statistics "mean" and "msq" are defined (theoretically). For having some doubt about what was intended (in section 3.3 of the paper) for "weighted mean" (were the weights considered only in the numerator or in the denominator too?) in the "mean50" statistic, I decided to look at the R code: the roast.default function calls .roastEffects, which performs the core tasks of the procedure, and inside of it are defined the test statistics. Here I've noticed that "mean" and "msq" are defined as follows:

> limma:::.roastEffects
function (effects, gene.weights = NULL, set.statistic = "mean",
    var.prior, df.prior, var.post, nrot = 999, approx.zscore = TRUE)
{

...

switch(set.statistic, mean = {
        if (!is.null(gene.weights)) modt <- gene.weights * modt
        m <- mean(modt)
        statobs["down"] <- -m
        statobs["up"] <- m
        statobs["mixed"] <- mean(abs(modt))

...

}, msq = {
        modt2 <- modt^2
        if (!is.null(gene.weights)) {
            modt2 <- abs(gene.weights) * modt2
            modt <- gene.weights * modt
        }
        statobs["down"] <- sum(modt2[modt < 0])/nset
        statobs["up"] <- sum(modt2[modt > 0])/nset
        statobs["upordown"] <- max(statobs[c("down", "up")])
        statobs["mixed"] <- mean(modt2)

...

}

that is, they are a bit different from how they are defined in the paper: the gene weights are only in the numerator and A, the sum of the absolute values of the weights, is not defined. If we follow what is written in the paper, we would have:

...
A <- sum(abs(gene.weights))
mean = {
        if (!is.null(gene.weights)) modt <- gene.weights * modt
        m <- sum(modt)/A
        statobs["down"] <- -m
        statobs["up"] <- m
        statobs["mixed"] <- sum(abs(modt))/A

...

}, msq = {
        modt2 <- modt^2
        if (!is.null(gene.weights)) {
            modt2 <- abs(gene.weights) * modt2
            modt <- gene.weights * modt
        }
        statobs["down"] <- sum(modt2[modt < 0])/A
        statobs["up"] <- sum(modt2[modt > 0])/A
        statobs["upordown"] <- max(statobs[c("down", "up")])
        statobs["mixed"] <- sum(modt2)/A

...

So my question is: what are the correct definitions? The code ones or the paper ones? Maybe this is a minor thing but I would like to be sure what to write in the dissertation. Thanks in advance for the answer.

P.S.: I'm using the most recent versions of both R and Bioconductor, as you can see from the sessionInfo output:

> sessionInfo()
R version 3.5.1 (2018-07-02)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 16.04.5 LTS

Matrix products: default
BLAS: /usr/lib/libblas/libblas.so.3.6.0
LAPACK: /usr/lib/lapack/liblapack.so.3.6.0

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C               LC_TIME=en_US.UTF-8       
 [4] LC_COLLATE=en_US.UTF-8     LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                  LC_ADDRESS=C              
[10] LC_TELEPHONE=C             LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] BiocInstaller_1.30.0 limma_3.36.2        

loaded via a namespace (and not attached):
[1] compiler_3.5.1 tools_3.5.1  
ADD COMMENTlink modified 5 months ago • written 5 months ago by Andrea Carenzo0
Answer: Does gene set statistics of the ROAST paper correspond to those implemented in t
1
gravatar for Gordon Smyth
5 months ago by
Gordon Smyth36k
Walter and Eliza Hall Institute of Medical Research, Melbourne, Australia
Gordon Smyth36k wrote:

The roast tests implemented in the limma package are the same as described in the article by Di Wu and myself in Bioinformatics.

It makes no difference whether the set statistics are divided by A or not, because the rotation p-values depend only on the rank order of the observed statistics relative to the rotated statistics. Dividing all the statistics by a constant factor makes no difference to the ranking or (hence) to the p-value.

The factor of A was included in the article for pedagogic value -- it shows that the statistic is equivalent to a weighted mean. The factor is omitted in the code though because it's computationally unnecessary.

ADD COMMENTlink modified 5 months ago • written 5 months ago by Gordon Smyth36k

Thanks for the reply mr. Smyth. I have some other (and I think last) questions about this topic: shouldn't the "floormean" statistic for the mixed hypothesis be defined as:

amodt <- pmax(abs(modt),sqrt(chimed))

and what's the meaning of the square-root of the median of the chi-squared distribution with 1 df? I can only think of it as a "minimum positive threshold", but I can't understand why you chose that particular quantity and not another one.

 

ADD REPLYlink modified 5 months ago • written 5 months ago by Andrea Carenzo0

Yes, you are right, chimed should be square-rooted. Thanks for pointing it out.

The reason for choosing this value is to make the mixed floormean statistics analogous to the directional floormean statistics. The directional floormean statstics are computed using

pmax(modt, 0)

Here 0 is the median of the distribution of the null distribution of modt. When the null hypothesis is true, we expect the upper 50% of the modt values to be left as they are and the lower 50% of values to be reset to the median.

The mixed statistics are analogous. For them we use

pmax(abs(modt), 0.67)

where 0.67 is the median of the distribution of abs(modt) when modt is distributed as N(0,1). So again we leave the upper 50% of amodt values unchanged and the lower 50% are reset to the median. The value of 0.67 is chosen to make the mixed statistics as closely analogous to the directional statistics as possible.

ADD REPLYlink modified 5 months ago • written 5 months ago by Gordon Smyth36k

Now I've got it, thanks again!

ADD REPLYlink written 5 months ago by Andrea Carenzo0
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 16.09
Traffic: 189 users visited in the last hour