Question

limma moderated t-statistics and B-statistics

33

Entering edit mode

Gordon Smyth 53k

@gordon-smyth

Last seen 9 hours ago

WEHI, Melbourne, Australia

This is to respond to a number of questions about the interpretation of the moderated t and B-statistics in limma. This will be a section of the Limma User's Guide in the next release.

Gordon
----------------------------------

Statistics for Differential Expression

A number of summary statistics are computed by the eBayes() function for each gene and each contrast. The M-value (M) is the log2-fold change, or sometimes the log2-expression level, for that gene. The A-value (A) is the the average expression level for that gene across all the arrays and channels. The moderated t-statistic (t) is the ratio of the M-value to its standard error. This has the same interpretation as an ordinary t-statistic except that the standard errors have been moderated across genes, effectively borrowing information from the ensemble of genes to aid with inference about each individual gene. The ordinary t-statistics are not usually required or recommended, but they can be recovered by

> tstat.ord <- fit$coef / fit$stdev.unscaled / fit$sigma

after fitting a linear model. The ordinary t-statistic p-values can be recovered by

> tstat.ord.p.value <- 2*pt( abs(tstat.ord), df=fit$df.residual, lower.tail=FALSE)

The ordinary t-statistic is on fit$df.residual degrees of freedom while the moderated t-statistic is on fit$df.residual+fit$df.prior degrees of freedom.

The p-value (p-value) is obtained from the moderated t-statistic, usually after some form of adjustment for multiple testing. The most popular form of adjustment is "fdr" which is Benjamini and Hochberg's method to control the false discovery rate. The meaning of the adjusted p-value is as follows. If you select all genes with p-value below a given value, say 0.05, as differentially expression, then the expected proportion of false discoveries in the selected group should be less than that value, in this case less than 5%.

The B-statistic (lods or B) is the log-odds that that gene is differentially expressed. Suppose for example that B=1.5. The odds of differential expression is exp(1.5)=4.48, i.e, about four and a half to one. The probability that the gene is differentially expressed is 4.48/(1+4.48)=0.82, i.e., the probability is about 82% that this gene is differentially expressed. A B-statistic of zero corresponds to a 50-50 chance that the gene is differentially expressed. The B-statistic is automatically adjusted for multiple testing by assuming that 1% of the genes, or some other percentage specified by the user, are expected to be differentially expressed. If there are no missing values in your data, then the moderated t and B statistics will rank the genes in exactly the same order. Even you do have spot weights or missing data, the p-values and B-statistics will usually provide a very similar ranking of the genes.

Please keep in mind that the moderated t-statistic p-values and the B-statistic probabilities depend on various sorts of mathematical assumptions which are never exactly true for microarray data. The B-statistics also depend on a prior guess for the proportion of differentially expressed genes. Therefore they are intended to be taken as a guide rather than as a strict measure of the probability of differential expression. Of the three statistics, the moderated-t, the associated p-value and the B-statistics, we usually base our gene selections on the p-value. All three measures are closely related, but the moderated-t and its p-value do not require a prior guess for the number of differentially expressed genes.

The above mentioned statistics are computed for every contrast for each gene. The eBayes() function computes one more useful statistic. The moderated F-statistic (F) combines the t-statistics for all the contrasts for each gene into an overall test of significance for that gene. The moderated F-statistic tests whether any of the contrasts are non-zero for that gene, i.e., whether that gene is differentially expressed on any contrast. The moderated-F has numerator degrees of freedom equal to the number of contrasts and denominator degrees of freedom the same as the moderated-t. Its p-value is stored as fit$F.p.value. It is similar to the ordinary F-statistic from analysis of variance except that the denominator mean squares are moderated across genes.

In complex experiments with many closely related contrasts, it may sometimes be desirable to adjust p-values across contrasts as well as across genes. The function decideTests() provides several methods to do this (see the Section "Multiple Testing Across Contrasts" in the limma User's Guide).

Microarray limma • 46k views

ADD COMMENT • link 21.4 years ago • updated 5.3 years ago Gordon Smyth 53k