Question

Why does limma contrast.fit produces NA's for all contrast although missing data only in one condition?

0

Entering edit mode

wewolski ▴ 10

@wewolski-8499

Last seen 2.4 years ago

Zurich

I will try to describe briefly what I am doing:

my design matrix

designMatrix <- model.matrix( ~ 0 + Condition + Plant, data=grp2$annotation_)

> unique(grp2$annotation_$Condition)
[1] "PC_16h_sys" "PC_16h_test" "PC_38h_sys" "PC_38h_test" "PC_96h_sys" "PC_96h_test"

Then I define the contrasts I want to compute:

  contrasts = c(
    "PC_38h_sys - PC_16h_sys" ,
    "PC_96h_sys - PC_16h_sys" ,
    "PC_38h_test - PC_16h_test" ,
    "PC_96h_test - PC_16h_test" )

Then I generate a fit with lmFit

       fit <- lmFit(intmat , designMatrix)

since in the variable intmat in row 38 are missing values for condition PC_96h_test

> fit$coefficients[38,]
 PC_16h_sys PC_16h_test  PC_38h_sys PC_38h_test  PC_96h_sys PC_96h_test     PlantB2 
 -4.0764436  -3.5428944  -3.4582324  -2.6018461  -3.4004178          NA  -0.2996508

next, I do compute contrasts and all the contrasts are NAs. I would have expected that only the contrast involving PC_96h_test are NA but not all contrasts.

cont <- limma::makeContrasts(contrasts = contrasts, levels =  levels)
lmfit.cont <- contrasts.fit(fit, cont)
> lmfit.cont[38,]
An object of class "MArrayLM"
$coefficients
              Contrasts
               PC_38h_sys - PC_16h_sys PC_96h_sys - PC_16h_sys PC_38h_test - PC_16h_test PC_96h_test - PC_16h_test
  CCnCP_000456                      NA                      NA                        NA                        NA

I very much like the contrasts interface and I much prefer it to working with relevel. Is there a way to better handle NA's with contrasts?

Looking at the code for contrasts.fit I see that there is the matrix multiplication:

fit$coefficients[38,] %*% cont

Which gives NA for all contrasts if the coefficient row contains just a single NA.

Doing something on the lines :

fit$coefficients[38,][!is.na(fit$coefficients[38,])] %*% cont[!is.na(fit$coefficients[38,]),]
      Contrasts
       PC_38h_sys - PC_16h_sys PC_96h_sys - PC_16h_sys PC_38h_test - PC_16h_test PC_96h_test - PC_16h_test
  [1,]               0.6182112               0.6760258                 0.9410483                  3.542894

could give me the coefficients I am looking for. But of course, there is much more going on in contrasts.fit

cov.coefficients, stdev.unscaled,

Can anyone more informed, tell me if contrasts.fit could be rewritten to better handle NAs for instance by going through the coefficients ,stdev.unscaled row by row dropping NAs etc. And just for completeness

Thank you

Witold

limma contrast missing data lmfit makecontrasts • 2.4k views

ADD COMMENT • link updated 6.6 years ago by Gordon Smyth 50k • written 6.7 years ago by wewolski ▴ 10

0

Entering edit mode

Thanks, Gordon,

I have now a solution based on the post:

contrasts in limma and missing values

but it is a hack, workaround, and features usually do not need workarounds.

Still, if you think that contrast.fit not handling NA is a feature why not just add contrast.fit.NA to limma which is able to handle NA's if this is what is asked for?

ADD REPLY • link updated 6.7 years ago by Gordon Smyth 50k • written 6.7 years ago by wewolski ▴ 10

0

Entering edit mode

Well, I don't consider one request in the past 9 years to be an overwhelming clamour of user requests. Believe me, one can't implement everything than everyone suggests and still write quality software.

For your part, if you want to comment on my answer, please post a comment (by clicking on "Add Comment") rather than answering your own question (by clicking on "Add your answer"). I'm a site moderator, so I've taken the liberty of moving your answer to be a comment.

ADD REPLY • link 6.7 years ago Gordon Smyth 50k

Gordon Smyth · Answer 1 · 2017-08-23

I've already answered this question long ago, see dealing with NAs in limma or contrasts in limma and missing values or Limma: NA handling in contrasts.fit . I'll repeat here at a bit greater length the same answer that I gave 10 years ago.

NA coefficients in linear models are, in general, quite problematic. For your linear model, the leading coefficients correspond to average log-expression values for different conditions, and these retain a meaning even though one of them is NA. For many other linear models, however, the interpretation of all the estimated coefficients can be seriously compromised, even the non-NA coefficients, when one or more of them are NA.

It is impossible for limma to reliably distinguish when the estimated coefficients are meaningful from when they are not, hence limma takes the conservative approach of making contrasts NA when any of the coefficients are NA.

Now you're the one who created the design matrix, and you may have the expertise to judge that the estimated coefficients remain meaningful even in the presence of an NA coefficient. In that case, you can (easily and elegantly) remove the missing coefficient by subsetting the fit object. You may consider this to be unnecessary trouble, but I prefer to ask users to do this rather than to risk returning rubbish results to non-expert users.

In my opinion, it is very seldom necessary to have NA values in expression data. Often people introduce NA values by the mistaken ideas that analyses can't handle below-threshold measurements or can't cope with lower quality values. I find that it is often possible to reprocess data so that NA values don't exist in the first place, and this is really the best solution.

In your case, why are the expression values for PC_96h_test all NA? Is it possible that the expression values have simply failed to meet some detection threshold for this condition? If that is so, then it would make sense to input low values (even somewhat arbitrary values) rather than NA because otherwise the information that the expression is low in this condition is being lost.