Hello to everyone,

I'm a mathematician and I'm writing a dissertation about the use of different *gene set tests* for the *overrepresentation* of lists of genes coming from several experiments. I'm only considering three (but very different) procedures: *CAMERA*, *ROAST* and *ROMER*, all coming from the *limma* package.

For now I'm focusing on the ROAST procedure and reading the paper by Wu et al. (*ROAST: rotation gene set tests for complex microarray experiments*, 2010) I've learned how the gene set statistics `"mean"`

and `"msq"`

are defined (theoretically). For having some doubt about what was intended (in section 3.3 of the paper) for "weighted mean" (were the weights considered only in the numerator or in the denominator too?) in the `"mean50"`

statistic, I decided to look at the R code: the `roast.default`

function calls `.roastEffects`

, which performs the core tasks of the procedure, and inside of it are defined the test statistics. Here I've noticed that `"mean"`

and `"msq"`

are defined as follows:

> limma:::.roastEffects function (effects, gene.weights = NULL, set.statistic = "mean", var.prior, df.prior, var.post, nrot = 999, approx.zscore = TRUE) { ... switch(set.statistic, mean = { if (!is.null(gene.weights)) modt <- gene.weights * modt m <- mean(modt) statobs["down"] <- -m statobs["up"] <- m statobs["mixed"] <- mean(abs(modt)) ... }, msq = { modt2 <- modt^2 if (!is.null(gene.weights)) { modt2 <- abs(gene.weights) * modt2 modt <- gene.weights * modt } statobs["down"] <- sum(modt2[modt < 0])/nset statobs["up"] <- sum(modt2[modt > 0])/nset statobs["upordown"] <- max(statobs[c("down", "up")]) statobs["mixed"] <- mean(modt2) ... }

that is, they are a bit different from how they are defined in the paper: the gene weights are only in the numerator and A, the sum of the absolute values of the weights, is not defined. If we follow what is written in the paper, we would have:

... A <- sum(abs(gene.weights)) mean = { if (!is.null(gene.weights)) modt <- gene.weights * modt m <- sum(modt)/A statobs["down"] <- -m statobs["up"] <- m statobs["mixed"] <- sum(abs(modt))/A ... }, msq = { modt2 <- modt^2 if (!is.null(gene.weights)) { modt2 <- abs(gene.weights) * modt2 modt <- gene.weights * modt } statobs["down"] <- sum(modt2[modt < 0])/A statobs["up"] <- sum(modt2[modt > 0])/A statobs["upordown"] <- max(statobs[c("down", "up")]) statobs["mixed"] <- sum(modt2)/A ...

So my question is: what are the correct definitions? The code ones or the paper ones? Maybe this is a minor thing but I would like to be sure what to write in the dissertation. Thanks in advance for the answer.

P.S.: I'm using the most recent versions of both R and Bioconductor, as you can see from the `sessionInfo`

output:

> sessionInfo() R version 3.5.1 (2018-07-02) Platform: x86_64-pc-linux-gnu (64-bit) Running under: Ubuntu 16.04.5 LTS Matrix products: default BLAS: /usr/lib/libblas/libblas.so.3.6.0 LAPACK: /usr/lib/lapack/liblapack.so.3.6.0 locale: [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C LC_TIME=en_US.UTF-8 [4] LC_COLLATE=en_US.UTF-8 LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 [7] LC_PAPER=en_US.UTF-8 LC_NAME=C LC_ADDRESS=C [10] LC_TELEPHONE=C LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C attached base packages: [1] stats graphics grDevices utils datasets methods base other attached packages: [1] BiocInstaller_1.30.0 limma_3.36.2 loaded via a namespace (and not attached): [1] compiler_3.5.1 tools_3.5.1

Thanks for the reply mr. Smyth. I have some other (and I think last) questions about this topic: shouldn't the

`"floormean"`

statistic for themixed hypothesisbe defined as:and what's the meaning of the square-root of the median of the chi-squared distribution with 1 df? I can only think of it as a

"minimum positive threshold", but I can't understand why you chose that particular quantity and not another one.Yes, you are right,

`chimed`

should be square-rooted. Thanks for pointing it out.The reason for choosing this value is to make the mixed floormean statistics analogous to the directional floormean statistics. The directional floormean statstics are computed using

Here 0 is the median of the distribution of the null distribution of modt. When the null hypothesis is true, we expect the upper 50% of the modt values to be left as they are and the lower 50% of values to be reset to the median.

The mixed statistics are analogous. For them we use

where 0.67 is the median of the distribution of abs(modt) when modt is distributed as N(0,1). So again we leave the upper 50% of amodt values unchanged and the lower 50% are reset to the median. The value of 0.67 is chosen to make the mixed statistics as closely analogous to the directional statistics as possible.

Now I've got it, thanks again!