limma: contrasting two groups of which one is completely missing returns small p-value
2
0
Entering edit mode
@constantin-ahlmann-eltze-20493
Last seen 1 day ago
Germany/Heidelberg/EMBL

I was surprised that limma sometimes returns very small p-values in comparisons where one group is completely missing. And I am sorry if this is already a well-known behavior, I tried to search for descriptions of this, but couldn't find anything that matched my exact observation.

library(limma)
#> Warning: package 'limma' was built under R version 4.2.2

limma_test <- function(y, dat, formula, contrast){
fit <- lmFit(y, design = model.matrix(formula, data = dat))
fit <- contrasts.fit(fit, contrasts = contrast)
fit <- eBayes(fit)
topTable(fit)
}

dat <- data.frame(cond = rep(c("a", "b", "c"), each = 3))
y_compl <- matrix(c(2, 4, 8, 17, 19, 15, 2, 5, 3), nrow = 1)
limma_test(y_compl, dat, ~ cond, c(0, -1, 0))
#>       logFC  AveExpr         t      P.Value    adj.P.Val          B
#> 1 -12.33333 8.333333 -6.609954 0.0005770005 0.0005770005 -0.9740698
limma_test(y_compl, dat, ~ cond - 1, c(1, -1, 0))
#>       logFC  AveExpr         t      P.Value    adj.P.Val          B
#> 1 -12.33333 8.333333 -6.609954 0.0005770005 0.0005770005 -0.9740698

y_mis <- matrix(c(NA, NA, NA, 17, 19, 15, 2, 5, 3), nrow = 1)
limma_test(y_mis, dat, ~ cond, c(0, -1, 0))
#> Warning: Partial NA coefficients for 1 probe(s)
#>       logFC  AveExpr         t      P.Value    adj.P.Val         B
#> 1 -13.66667 10.16667 -9.406045 0.0007120111 0.0007120111 -1.005365
limma_test(y_mis, dat, ~ cond - 1, c(1, -1, 0))
#> Warning: Partial NA coefficients for 1 probe(s)
#> Warning in .ebayes(fit = fit, proportion = proportion, stdev.coef.lim =
#> stdev.coef.lim, : Estimation of var.prior failed - set to default value
#>   logFC  AveExpr  t P.Value adj.P.Val  B
#> 1    NA 10.16667 NA      NA        NA NA
limma_test(y_mis, dat, ~ cond - 1, c(0, -1, 1))
#> Warning: Partial NA coefficients for 1 probe(s)
#>       logFC  AveExpr         t      P.Value    adj.P.Val         B
#> 1 -13.66667 10.16667 -9.406045 0.0007120111 0.0007120111 -1.005365


(This is using R 4.2.1 and limma 3.54.1)

The first two calls to 'limma_test' show that it usually doesn't matter how the design is specified if the contrast is adapted accordingly. However, if the data contains missing values, in my specific example, limma returns a small p-value that actually corresponds to the comparison b vs. c.

Can someone reproduce this behavior, and is the behavior expected? And lastly, is there a workaround to ensure that, independent of the specification of the design matrix, a contrast between two groups where one is completely missing the p-value is NA?

limma • 204 views
1
Entering edit mode
@james-w-macdonald-5106
Last seen 5 hours ago
United States

With any linear model, NA predictors are dropped. As an example

> df <- data.frame(dat, y = t(y_compl))
> df2 <- df
> df2[1:3, 1] <- NA
> summary(lm(y~cond, df))

Call:
lm(formula = y ~ cond, data = df)

Residuals:
Min      1Q  Median      3Q     Max
-2.6667 -1.3333 -0.3333  1.6667  3.3333

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept)    4.667      1.319   3.537 0.012263 *
condb         12.333      1.866   6.610 0.000577 ***
condc         -1.333      1.866  -0.715 0.501705
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 2.285 on 6 degrees of freedom
Multiple R-squared:  0.9158,    Adjusted R-squared:  0.8877
F-statistic: 32.62 on 2 and 6 DF,  p-value: 0.0005976

> summary(lm(y~cond, df2))

Call:
lm(formula = y ~ cond, data = df2)

Residuals:
4          5          6          7          8          9
-3.553e-15  2.000e+00 -2.000e+00 -1.333e+00  1.667e+00 -3.333e-01

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept)   17.000      1.027  16.547 7.81e-05 ***
condc        -13.667      1.453  -9.406 0.000712 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1.78 on 4 degrees of freedom
(3 observations deleted due to missingness)
Multiple R-squared:  0.9567,    Adjusted R-squared:  0.9459
F-statistic: 88.47 on 1 and 4 DF,  p-value: 0.000712


The workaround is to not have NA predictor values, or to realize that they are going to be dropped and planning accordingly.

0
Entering edit mode

I should probably clarify. For any linear model, any observations with NA predictors are dropped. So if the A condition is all NA, all of the data for that condition will be dropped, and the model will be fit to the remaining data, in which case the new baseline will be the 'B' group, and the only comparison is C vs B. This is true if you have any NA values for any of the predictors in the model. By default you will do a 'complete cases' analysis, where you only use those data with complete observations for all the predictors.

The limma package is nice enough to let you know that happened, as compared to say, lm, which just drops the data with NA predictors unceremoniously.

0
Entering edit mode

Hi James, thanks for the answer. I am confused though because the predictors are not NA in my example it's only the respones (y) contain NA's. Or do I misunderstand something?

If I understand the code in limma correctly (https://code.bioconductor.org/browse/limma/blob/devel/R/lmfit.R?branch=devel&file=R/lmfit.R#L165), limma manually filters out the missing observations.

0
Entering edit mode

I just found a neat function in the lmerTest package (https://github.com/cran/lmerTest/blob/15e4eca5b0ebc1b91ec5ede43e8296edff74ebf7/R/estimability.R#L61), which checks if a contrast is estimable. I wonder if such an approach could be integrated into contrast.fit.

1
Entering edit mode
@gordon-smyth
Last seen 4 hours ago
WEHI, Melbourne, Australia

Yes, it is true that the meaning of the limma coefficients will change if you fit a model with an intercept and all the observations in the reference level of the treatment are missing. This is a consequence of subsetting the design matrix to remove missing y-values from the linear model.

Using your example, with the full design matrix:

> cond = rep(c("a", "b", "c"), each = 3)
> design <- model.matrix(~cond)
> design
(Intercept) condb condc
1           1     0     0
2           1     0     0
3           1     0     0
4           1     1     0
5           1     1     0
6           1     1     0
7           1     0     1
8           1     0     1
9           1     0     1


the second coefficient corresponds to b vs a. However if the first three observations are removed:

> design[4:9,]
(Intercept) condb condc
4           1     1     0
5           1     1     0
6           1     1     0
7           1     0     1
8           1     0     1
9           1     0     1


then the design matrix is no longer of full rank. The 3rd coefficient will be removed from the analysis and the second coefficient will compare b to c instead of b to a.

The work around is to use the group mean parametrization

design <- model.matrix(~0+cond)


then this problem will not occur.

I have discussed this issue previously on this forum. The basic rule is that the coefficients cannot be interpreted according to their original meanings if NAs cause the design matrix to become not of full rank. limma issues a warning whenever this occurs. The oneway layout with the group-mean parametrization is an exception for which the coefficients will retain their interpretations.

0
Entering edit mode

Hi Gordon, thanks for taking the time to answer the question. If you allow a follow-up: Is it possible to add some check like lmerTest::is_estimable to fit.constrast? Or alternative do you know if there's an efficient way to check for many features with varying missing values for which features a contrast is estimable?

0
Entering edit mode

It's aready clear which features are affected -- the ones with NA coefficients. It is an issue with coefficients rather than with contrasts.