Search
Question: Limma Singlet Moderated T-Test
0
10 months ago by
United Kingdom
andrew.j.skelton73310 wrote:

Hi,

I only noticed this by chance, but it peaked my curiosity and couldn't find any documentation about it. I occasionally get asked about experiments where conditions are in singlet (no replication), and my typical mandate is to run the line of "no replication, no statistics", however in the case of making the most of data available, I was aware of DESeq2's single sample method in which all samples are treated as a single group for dispersion estimation. When running the same singlet design in Limma, I expected an error when testing the model fit, but I got nothing.

So my question is, what's Limma doing in the case of a singlet experiment A-B, as my assumption was that the moderated t-test couldn't be calculated without replication?

Some code just to make a reproducible example:

library(limma)
set.seed(1234)
pheno_table            <- data.frame(Rep      = c(paste0("A",1:8)),
Variable = c("A","B","C","C","C","D","D","D"))
mat_rows               <- 100
matrix_in              <- matrix(log2(rnorm(n = (mat_rows*nrow(pheno_table)), mean = 100, sd = 30)), nrow = mat_rows)
design                 <- model.matrix(~0 + Variable, data = pheno_table)
colnames(design)       <- gsub("Variable", "", colnames(design))
contrasts              <- c("A_Vs_B" = "B-A")
contrast_mat           <- makeContrasts(contrasts = contrasts, levels = colnames(design))
colnames(contrast_mat) <- names(contrasts)
fit                    <- eBayes(contrasts.fit(lmFit(matrix_in, design),contrast_mat))
topTable(fit, coef = 1, number = 5, p.value = 0.1)

#      logFC  AveExpr        t      P.Value  adj.P.Val         B
# 92 -2.909594 6.21836 -4.935319 0.0009739746 0.09739746 -0.3183399
modified 8 months ago by Gordon Smyth33k • written 10 months ago by andrew.j.skelton73310
3
10 months ago by
Aaron Lun19k
Cambridge, United Kingdom
Aaron Lun19k wrote:

In your case, you're still estimating the sample variance from groups C and D. limma will use all samples to estimate the variance, even if the comparisons only involve a subset of samples. There are non-zero residual degrees of freedom in this model, so there is no need to resort to non-standard methods for variance estimation.

In particular, the "single-sample" approach used by DESeq2 is not without problems. Let's consider the application of this strategy in the context of a limma analysis. Sure, you could get an estimate of a common variance if you assume that most genes are not DE between conditions. (The same arguments apply to estimation of a mean-variance trend; I will stick with a common estimate for simplicity.) However, using this estimate in a moderated t-test would be equivalent to having infinite prior degrees of freedom, i.e., the true variance for each gene is the same. I have never seen an RNA-seq data set where this is the case - the prior d.f. is generally low, around 5-20, reflecting the differences in the variability between genes.

Overstating the prior d.f. is problematic as it suggests strong homogeneity in the true variances where there is none. This usually results in anticonservativeness, where you get more false positives than expected due to unmodelled variability in the variance estimates. The same concern applies to count-based models as well, which is why the edgeR user's guide is reluctant to suggest this strategy as option 3 in Section 2.11. But for your experiment, none of this matters: limma can directly estimate the variance as part of a standard analysis.

Thanks for suggesting the edgeR manual, it's nice to have a place where it's written down that I can point people to. I think it'd be useful for Limma to possibly output a warning where singlet comparisons are made stipulating a summary of what you've said. Thanks though for the comprehensive response!

Singlet comparisons (n=1 in some groups but n>1 in others) are quite routine and I don't see any reason for a warning from limma. The complications that Aaron discussed above relate to DESeq2's single sample method, which is quite a different thing, and he was explaining why it isn't used in limma. By contrast, limma's method for your experiment is standard linear modelling and does not require any such compromises.

DESeq2's single sample method and edgeR's no replicate methods are really just red herrings here. Your experiment does have replication and none of the packages (limma, edgeR or DESeq2) would need to resort to a "no replicate" method for your design.

0
8 months ago by
Gordon Smyth33k
Walter and Eliza Hall Institute of Medical Research, Melbourne, Australia
Gordon Smyth33k wrote:

But you do have replication! Your experiment has 3 replicates in group C and 3 replicates in group D. Your design thereby provides 4 residual degrees of freedom from which the residual variance can be estimated and limma, like any good linear modelling software, makes use of that.

This is no different from the lm() function in the stats package. Here is a little simulated example with lm() applied to your design:

> y <- rnorm(8)
> Variable = factor(c("A","B","C","C","C","D","D","D"))
> fit <- lm(y ~ Variable)
> anova(fit)
Analysis of Variance Table

Response: y
Df Sum Sq Mean Sq F value Pr(>F)
Variable   3 1.9051 0.63503  0.5263 0.6875
Residuals  4 4.8265 1.20662               

Notice that Df=4 for the residuals and the residual variance is estimated as 1.20662. Now let's use that to compute t-tests:

> summary(fit)

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept)   0.7275     1.0985   0.662    0.544
VariableB    -0.9125     1.5535  -0.587    0.588
VariableC    -1.4630     1.2684  -1.153    0.313
VariableD    -1.4415     1.2684  -1.136    0.319

Residual standard error: 1.098 on 4 degrees of freedom
Multiple R-squared:  0.283,     Adjusted R-squared:  -0.2547
F-statistic: 0.5263 on 3 and 4 DF,  p-value: 0.6875

This is all quite standard. No need for any warning. limma does the same base computations as lm() but adds in Bayesian moderation.

On the other hand, if you really did have no replication, then limma would tell you:

> library(limma)
> matrix_in <- matrix(rnorm(400),100,4)
> Variable <- factor(c("A","B","C","D"))
> design <- model.matrix(~Variable)
> fit <- lmFit(matrix_in, design)
> fit <- eBayes(fit)
Error in ebayes(fit = fit, proportion = proportion, stdev.coef.lim = stdev.coef.lim,  :
No residual degrees of freedom in linear model fits


Note that the same general principles apply to the edgeR and DESeq2 packages as well, and neither of them would have any difficulties with your experimental design either. Section 2.11 of the edgeR User's Guide refers to designs that have no replication at all, not to an experiment such as yours. The same goes for DESeq2's single sample method -- it is only for experiments with no replicates at all.