I was surprised that limma sometimes returns very small p-values in comparisons where one group is completely missing. And I am sorry if this is already a well-known behavior, I tried to search for descriptions of this, but couldn't find anything that matched my exact observation.
library(limma)
#> Warning: package 'limma' was built under R version 4.2.2
limma_test <- function(y, dat, formula, contrast){
fit <- lmFit(y, design = model.matrix(formula, data = dat))
fit <- contrasts.fit(fit, contrasts = contrast)
fit <- eBayes(fit)
topTable(fit)
}
dat <- data.frame(cond = rep(c("a", "b", "c"), each = 3))
y_compl <- matrix(c(2, 4, 8, 17, 19, 15, 2, 5, 3), nrow = 1)
limma_test(y_compl, dat, ~ cond, c(0, -1, 0))
#> logFC AveExpr t P.Value adj.P.Val B
#> 1 -12.33333 8.333333 -6.609954 0.0005770005 0.0005770005 -0.9740698
limma_test(y_compl, dat, ~ cond - 1, c(1, -1, 0))
#> logFC AveExpr t P.Value adj.P.Val B
#> 1 -12.33333 8.333333 -6.609954 0.0005770005 0.0005770005 -0.9740698
y_mis <- matrix(c(NA, NA, NA, 17, 19, 15, 2, 5, 3), nrow = 1)
limma_test(y_mis, dat, ~ cond, c(0, -1, 0))
#> Warning: Partial NA coefficients for 1 probe(s)
#> logFC AveExpr t P.Value adj.P.Val B
#> 1 -13.66667 10.16667 -9.406045 0.0007120111 0.0007120111 -1.005365
limma_test(y_mis, dat, ~ cond - 1, c(1, -1, 0))
#> Warning: Partial NA coefficients for 1 probe(s)
#> Warning in .ebayes(fit = fit, proportion = proportion, stdev.coef.lim =
#> stdev.coef.lim, : Estimation of var.prior failed - set to default value
#> logFC AveExpr t P.Value adj.P.Val B
#> 1 NA 10.16667 NA NA NA NA
limma_test(y_mis, dat, ~ cond - 1, c(0, -1, 1))
#> Warning: Partial NA coefficients for 1 probe(s)
#> logFC AveExpr t P.Value adj.P.Val B
#> 1 -13.66667 10.16667 -9.406045 0.0007120111 0.0007120111 -1.005365
(This is using R 4.2.1 and limma 3.54.1)
The first two calls to 'limma_test' show that it usually doesn't matter how the design is specified if the contrast is adapted accordingly. However, if the data contains missing values, in my specific example, limma returns a small p-value that actually corresponds to the comparison b vs. c.
Can someone reproduce this behavior, and is the behavior expected? And lastly, is there a workaround to ensure that, independent of the specification of the design matrix, a contrast between two groups where one is completely missing the p-value is NA
?
I should probably clarify. For any linear model, any observations with
NA
predictors are dropped. So if the A condition is allNA
, all of the data for that condition will be dropped, and the model will be fit to the remaining data, in which case the new baseline will be the 'B' group, and the only comparison is C vs B. This is true if you have anyNA
values for any of the predictors in the model. By default you will do a 'complete cases' analysis, where you only use those data with complete observations for all the predictors.The
limma
package is nice enough to let you know that happened, as compared to say,lm
, which just drops the data withNA
predictors unceremoniously.Hi James, thanks for the answer. I am confused though because the predictors are not
NA
in my example it's only the respones (y
) containNA
's. Or do I misunderstand something?If I understand the code in limma correctly (https://code.bioconductor.org/browse/limma/blob/devel/R/lmfit.R?branch=devel&file=R/lmfit.R#L165), limma manually filters out the missing observations.
I just found a neat function in the
lmerTest
package (https://github.com/cran/lmerTest/blob/15e4eca5b0ebc1b91ec5ede43e8296edff74ebf7/R/estimability.R#L61), which checks if a contrast is estimable. I wonder if such an approach could be integrated intocontrast.fit
.