Question

Limma Singlet Moderated T-Test

0

Entering edit mode

andrew.j.skelton73 ▴ 370

@andrewjskelton73-7074

Last seen 12 months ago

United Kingdom

Hi,

I only noticed this by chance, but it peaked my curiosity and couldn't find any documentation about it. I occasionally get asked about experiments where conditions are in singlet (no replication), and my typical mandate is to run the line of "no replication, no statistics", however in the case of making the most of data available, I was aware of DESeq2's single sample method in which all samples are treated as a single group for dispersion estimation. When running the same singlet design in Limma, I expected an error when testing the model fit, but I got nothing.

So my question is, what's Limma doing in the case of a singlet experiment A-B, as my assumption was that the moderated t-test couldn't be calculated without replication?

Some code just to make a reproducible example:

library(limma)
set.seed(1234)
pheno_table            <- data.frame(Rep      = c(paste0("A",1:8)), 
                                     Variable = c("A","B","C","C","C","D","D","D"))
mat_rows               <- 100
matrix_in              <- matrix(log2(rnorm(n = (mat_rows*nrow(pheno_table)), mean = 100, sd = 30)), nrow = mat_rows)
design                 <- model.matrix(~0 + Variable, data = pheno_table)
colnames(design)       <- gsub("Variable", "", colnames(design))
contrasts              <- c("A_Vs_B" = "B-A")
contrast_mat           <- makeContrasts(contrasts = contrasts, levels = colnames(design))
colnames(contrast_mat) <- names(contrasts)
fit                    <- eBayes(contrasts.fit(lmFit(matrix_in, design),contrast_mat))
topTable(fit, coef = 1, number = 5, p.value = 0.1)

#      logFC  AveExpr        t      P.Value  adj.P.Val         B
# 92 -2.909594 6.21836 -4.935319 0.0009739746 0.09739746 -0.3183399

limma • 1.5k views

ADD COMMENT • link updated 7.6 years ago by Gordon Smyth 52k • written 7.8 years ago by andrew.j.skelton73 ▴ 370

0

Entering edit mode

Gordon Smyth 52k

@gordon-smyth

Last seen 18 hours ago

WEHI, Melbourne, Australia

But you do have replication! Your experiment has 3 replicates in group C and 3 replicates in group D. Your design thereby provides 4 residual degrees of freedom from which the residual variance can be estimated and limma, like any good linear modelling software, makes use of that.

This is no different from the lm() function in the stats package. Here is a little simulated example with lm() applied to your design:

> y <- rnorm(8)
> Variable = factor(c("A","B","C","C","C","D","D","D"))
> fit <- lm(y ~ Variable)
> anova(fit)
Analysis of Variance Table

Response: y
          Df Sum Sq Mean Sq F value Pr(>F)
Variable   3 1.9051 0.63503  0.5263 0.6875
Residuals  4 4.8265 1.20662

Notice that Df=4 for the residuals and the residual variance is estimated as 1.20662. Now let's use that to compute t-tests:

> summary(fit)

Coefficients:
            Estimate Std. Error t value Pr(>|t|)
(Intercept)   0.7275     1.0985   0.662    0.544
VariableB    -0.9125     1.5535  -0.587    0.588
VariableC    -1.4630     1.2684  -1.153    0.313
VariableD    -1.4415     1.2684  -1.136    0.319

Residual standard error: 1.098 on 4 degrees of freedom
Multiple R-squared:  0.283,     Adjusted R-squared:  -0.2547 
F-statistic: 0.5263 on 3 and 4 DF,  p-value: 0.6875

This is all quite standard. No need for any warning. limma does the same base computations as lm() but adds in Bayesian moderation.

On the other hand, if you really did have no replication, then limma would tell you:

> library(limma)
> matrix_in <- matrix(rnorm(400),100,4)
> Variable <- factor(c("A","B","C","D"))
> design <- model.matrix(~Variable)
> fit <- lmFit(matrix_in, design)
> fit <- eBayes(fit)
Error in ebayes(fit = fit, proportion = proportion, stdev.coef.lim = stdev.coef.lim,  : 
  No residual degrees of freedom in linear model fits

Note that the same general principles apply to the edgeR and DESeq2 packages as well, and neither of them would have any difficulties with your experimental design either. Section 2.11 of the edgeR User's Guide refers to designs that have no replication at all, not to an experiment such as yours. The same goes for DESeq2's single sample method -- it is only for experiments with no replicates at all.

ADD COMMENT • link 7.6 years ago Gordon Smyth 52k

score 3 · Accepted Answer · 2017-07-02

In your case, you're still estimating the sample variance from groups C and D. limma will use all samples to estimate the variance, even if the comparisons only involve a subset of samples. There are non-zero residual degrees of freedom in this model, so there is no need to resort to non-standard methods for variance estimation.

In particular, the "single-sample" approach used by DESeq2 is not without problems. Let's consider the application of this strategy in the context of a limma analysis. Sure, you could get an estimate of a common variance if you assume that most genes are not DE between conditions. (The same arguments apply to estimation of a mean-variance trend; I will stick with a common estimate for simplicity.) However, using this estimate in a moderated t-test would be equivalent to having infinite prior degrees of freedom, i.e., the true variance for each gene is the same. I have never seen an RNA-seq data set where this is the case - the prior d.f. is generally low, around 5-20, reflecting the differences in the variability between genes.

Overstating the prior d.f. is problematic as it suggests strong homogeneity in the true variances where there is none. This usually results in anticonservativeness, where you get more false positives than expected due to unmodelled variability in the variance estimates. The same concern applies to count-based models as well, which is why the edgeR user's guide is reluctant to suggest this strategy as option 3 in Section 2.11. But for your experiment, none of this matters: limma can directly estimate the variance as part of a standard analysis.