Why does limma contrast.fit produces NA's for all contrast although missing data only in one condition?
Entering edit mode
wewolski ▴ 10
Last seen 8 weeks ago

I will try to describe briefly what I am doing:

my design matrix

designMatrix <- model.matrix( ~ 0 + Condition + Plant, data=grp2$annotation_)

> unique(grp2$annotation_$Condition)
[1] "PC_16h_sys"  "PC_16h_test" "PC_38h_sys"  "PC_38h_test" "PC_96h_sys"  "PC_96h_test"

Then I define the contrasts I want to compute: 

  contrasts = c(
    "PC_38h_sys - PC_16h_sys" ,
    "PC_96h_sys - PC_16h_sys" ,
    "PC_38h_test - PC_16h_test" ,
    "PC_96h_test - PC_16h_test" )

Then I generate a fit with lmFit

       fit <- lmFit(intmat , designMatrix)

since in the variable intmat in row 38 are missing values for condition PC_96h_test  

> fit$coefficients[38,]
 PC_16h_sys PC_16h_test  PC_38h_sys PC_38h_test  PC_96h_sys PC_96h_test     PlantB2 
 -4.0764436  -3.5428944  -3.4582324  -2.6018461  -3.4004178          NA  -0.2996508 

next, I do compute contrasts and all the contrasts are NAs. I would have expected that only the contrast involving PC_96h_test are NA but not all contrasts.


cont <- limma::makeContrasts(contrasts = contrasts, levels =  levels)
lmfit.cont <- contrasts.fit(fit, cont)
> lmfit.cont[38,]
An object of class "MArrayLM"
               PC_38h_sys - PC_16h_sys PC_96h_sys - PC_16h_sys PC_38h_test - PC_16h_test PC_96h_test - PC_16h_test
  CCnCP_000456                      NA                      NA                        NA                        NA


I very much like the contrasts interface and I much prefer it to working with relevel. Is there a way to better handle NA's with contrasts?

Looking at the code for contrasts.fit I see that there is the matrix multiplication:
fit$coefficients[38,] %*% cont

Which gives NA for all contrasts if the coefficient row contains just a single NA.

Doing something on the lines :

fit$coefficients[38,][!is.na(fit$coefficients[38,])] %*% cont[!is.na(fit$coefficients[38,]),]
       PC_38h_sys - PC_16h_sys PC_96h_sys - PC_16h_sys PC_38h_test - PC_16h_test PC_96h_test - PC_16h_test
  [1,]               0.6182112               0.6760258                 0.9410483                  3.542894

could give me the coefficients I am looking for. But of course, there is much more going on in contrasts.fit

cov.coefficients, stdev.unscaled, 

Can anyone more informed, tell me if contrasts.fit could be rewritten to better handle NAs for instance by going through the coefficients ,stdev.unscaled  row by row dropping NAs etc. And just for completeness 

Thank you


limma contrast missing data lmfit makecontrasts • 1.4k views
Entering edit mode

Thanks, Gordon,

I have now a solution based on the post:

 contrasts in limma and missing values

but it is a hack, workaround, and features usually do not need workarounds. 

Still, if you think that contrast.fit not handling NA is a feature why not just add contrast.fit.NA to limma which is able to handle NA's if this is what is asked for?

Entering edit mode

Well, I don't consider one request in the past 9 years to be an overwhelming clamour of user requests. Believe me, one can't implement everything than everyone suggests and still write quality software.

For your part, if you want to comment on my answer, please post a comment (by clicking on "Add Comment") rather than answering your own question (by clicking on "Add your answer"). I'm a site moderator, so I've taken the liberty of moving your answer to be a comment.

Entering edit mode
Last seen 2 hours ago
WEHI, Melbourne, Australia

I've already answered this question long ago, see dealing with NAs in limma or contrasts in limma and missing values or Limma: NA handling in contrasts.fit . I'll repeat here at a bit greater length the same answer that I gave 10 years ago.

NA coefficients in linear models are, in general, quite problematic. For your linear model, the leading coefficients correspond to average log-expression values for different conditions, and these retain a meaning even though one of them is NA. For many other linear models, however, the interpretation of all the estimated coefficients can be seriously compromised, even the non-NA coefficients, when one or more of them are NA.

It is impossible for limma to reliably distinguish when the estimated coefficients are meaningful from when they are not, hence limma takes the conservative approach of making contrasts NA when any of the coefficients are NA.

Now you're the one who created the design matrix, and you may have the expertise to judge that the estimated coefficients remain meaningful even in the presence of an NA coefficient. In that case, you can (easily and elegantly) remove the missing coefficient by subsetting the fit object. You may consider this to be unnecessary trouble, but I prefer to ask users to do this rather than to risk returning rubbish results to non-expert users.

In my opinion, it is very seldom necessary to have NA values in expression data. Often people introduce NA values by the mistaken ideas that analyses can't handle below-threshold measurements or can't cope with lower quality values. I find that it is often possible to reprocess data so that NA values don't exist in the first place, and this is really the best solution.

In your case, why are the expression values for PC_96h_test all NA? Is it possible that the expression values have simply failed to meet some detection threshold for this condition? If that is so, then it would make sense to input low values (even somewhat arbitrary values) rather than NA because otherwise the information that the expression is low in this condition is being lost.

Entering edit mode

"Is it possible that the expression values have simply failed to meet some detection threshold for this condition?"

No, it is not. Let me assure you, I did try to understand and model where the missingness is coming from.  If I used the imputation you suggest I would introduce a heavy bias. Bias is generally the danger of using imputation. Therefore, I am surprised, that someone with your standing would suggest using imputation, without at least making a lot of explicit and implicit disclaimers and warnings.

And even if there was a nice way to model the missingness and impute the data, I still would like to run the analysis twice, once with NA's and once with the imputed data to see what differences the imputation produces, especially for those cases where I have in one condition missing values.

So I do not take "use imputation" as an answer, to me asking: 

"Can anyone more informed, tell me if contrasts.fit could be rewritten to better handle NAs for instance by going through the coefficients ,stdev.unscaled  row by row dropping NAs etc."


Entering edit mode

If you used imputation in the way and in the circumstance that I suggested it, then it would reduce bias. That's why I suggested it.

Whether or not imputation is relevant to your situation, I have already explained to you why contrasts.fit isn't written in the way that you demand.

You can easily do the analysis you want to do, despite the NAs, as I explained in my answers from 8 years ago.


Login before adding your answer.

Traffic: 465 users visited in the last hour
Help About
Access RSS

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6