**0**wrote:

Hello to everyone,

I'm a mathematician and I'm writing a dissertation about the use of different *gene set tests* for the *overrepresentation* of lists of genes coming from several experiments. I'm only considering three (but very different) procedures: *CAMERA*, *ROAST* and *ROMER*, all coming from the *limma* package.

For now I'm focusing on the ROAST procedure and reading the paper by Wu et al. (*ROAST: rotation gene set tests for complex microarray experiments*, 2010) I've learned how the gene set statistics `"mean"`

and `"msq"`

are defined (theoretically). For having some doubt about what was intended (in section 3.3 of the paper) for "weighted mean" (were the weights considered only in the numerator or in the denominator too?) in the `"mean50"`

statistic, I decided to look at the R code: the `roast.default`

function calls `.roastEffects`

, which performs the core tasks of the procedure, and inside of it are defined the test statistics. Here I've noticed that `"mean"`

and `"msq"`

are defined as follows:

> limma:::.roastEffects function (effects, gene.weights = NULL, set.statistic = "mean", var.prior, df.prior, var.post, nrot = 999, approx.zscore = TRUE) { ... switch(set.statistic, mean = { if (!is.null(gene.weights)) modt <- gene.weights * modt m <- mean(modt) statobs["down"] <- -m statobs["up"] <- m statobs["mixed"] <- mean(abs(modt)) ... }, msq = { modt2 <- modt^2 if (!is.null(gene.weights)) { modt2 <- abs(gene.weights) * modt2 modt <- gene.weights * modt } statobs["down"] <- sum(modt2[modt < 0])/nset statobs["up"] <- sum(modt2[modt > 0])/nset statobs["upordown"] <- max(statobs[c("down", "up")]) statobs["mixed"] <- mean(modt2) ... }

that is, they are a bit different from how they are defined in the paper: the gene weights are only in the numerator and A, the sum of the absolute values of the weights, is not defined. If we follow what is written in the paper, we would have:

... A <- sum(abs(gene.weights)) mean = { if (!is.null(gene.weights)) modt <- gene.weights * modt m <- sum(modt)/A statobs["down"] <- -m statobs["up"] <- m statobs["mixed"] <- sum(abs(modt))/A ... }, msq = { modt2 <- modt^2 if (!is.null(gene.weights)) { modt2 <- abs(gene.weights) * modt2 modt <- gene.weights * modt } statobs["down"] <- sum(modt2[modt < 0])/A statobs["up"] <- sum(modt2[modt > 0])/A statobs["upordown"] <- max(statobs[c("down", "up")]) statobs["mixed"] <- sum(modt2)/A ...

So my question is: what are the correct definitions? The code ones or the paper ones? Maybe this is a minor thing but I would like to be sure what to write in the dissertation. Thanks in advance for the answer.

P.S.: I'm using the most recent versions of both R and Bioconductor, as you can see from the `sessionInfo`

output:

> sessionInfo() R version 3.5.1 (2018-07-02) Platform: x86_64-pc-linux-gnu (64-bit) Running under: Ubuntu 16.04.5 LTS Matrix products: default BLAS: /usr/lib/libblas/libblas.so.3.6.0 LAPACK: /usr/lib/lapack/liblapack.so.3.6.0 locale: [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C LC_TIME=en_US.UTF-8 [4] LC_COLLATE=en_US.UTF-8 LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 [7] LC_PAPER=en_US.UTF-8 LC_NAME=C LC_ADDRESS=C [10] LC_TELEPHONE=C LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C attached base packages: [1] stats graphics grDevices utils datasets methods base other attached packages: [1] BiocInstaller_1.30.0 limma_3.36.2 loaded via a namespace (and not attached): [1] compiler_3.5.1 tools_3.5.1