Question

limma with missing values

0

Entering edit mode

Hans-Ulrich Klein ▴ 330

@hans-ulrich-klein-1945

Last seen 7 months ago

United States

Dear all,

I am struggling with using limma in the case of many missing values. Here is a small example:

df = data.frame(
  g1=c(5,5,6,6,NA,NA),
  g2=c(5,5,NA,NA,6,6),
  g3=c(NA,NA,5,5,6,6),
  treat=c("T1","T1","T2","T2","T3","T3"),
  class=c("A","B","A","B","A","B"))

  conts = list(treat="contr.sum", class="contr.treatment")

  X = t(df[,1:3])
  D = model.matrix( ~ class + treat, data=df, contrasts=conts)
  Fit = lmFit(X, D)

These are the results:

 >   Fit$coefficients
   (Intercept)       classB treatT2 treatT3
g1           5 2.220446e-16       1      NA
g2           5 2.220446e-16      NA       1
g3           6 3.330669e-16      -1      NA

Suppose I am interessted in the comparison of T2-T1. The column "treatT2" of "Fit$coefficients" looks fine for gene 1 and gene 2. In the case of gene 3 the baseline level T1 can not be estimated and the coefficient treatT2 seems to be the comparison of T2-T3. I find this confusing when computing many thousand models and applying toptable(Fit, coef="treatT2") afterwards.

I changed the design matrix by setting conts = list(treat="contr.sum", class="contr.sum") Now, I get the results:

 >   Fit$coefficients
   (Intercept)        class1 treat1 treat2
g1         6.0 -1.665335e-16   -1.0     NA
g2         5.5 -1.665335e-16   -0.5     NA
g3         5.0 -1.110223e-16   -1.0     NA

However, I expected:

    Intercept    treat1   treat2
g1   5.5              -0.5     0.5
g2   5.5              -0.5     NA
g3   5.5              NA      -0.5

Is it appropriate to use limma in such scenarios? Did I something wrong or misinterpreted the results? I think, replacing the NAs by mean intensities would be a solution in this case.

Regards,
Hans-Ulrich

> sessionInfo()
R version 2.6.2 (2008-02-08)
x86_64-pc-linux-gnu

locale:
C

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base

other attached packages:
[1] limma_2.12.0

loaded via a namespace (and not attached):
[1] rcompgen_0.1-17

limma • 5.4k views

ADD COMMENT • link updated 16.1 years ago by Gordon Smyth 50k • written 16.1 years ago by Hans-Ulrich Klein ▴ 330

Gordon Smyth · Answer 1 · 2008-03-08

1

Entering edit mode

Gordon Smyth 50k

@gordon-smyth

Last seen 56 minutes ago

WEHI, Melbourne, Australia

Dear Hans-Ulrich,

The output that you give does not match your code. The correct output is given below.

limma treats missing values in exactly the same way as other linear model function in R, such as lm(), glm() etc. For each gene, the rows with missing values are removed both from the data and design matrix, then non-estimable coefficients are removed (given NA) from the end of the design matrix. See the documentation for those functions for more information.

The ideal solution is not to introduce missing values into your data in the first place. In my experience, missing values are almost always avoidable. I have never seen a situation where it was necessary or desirable to introduce a large proportion of missing values.

Best wishes
Gordon

> library(limma)
> df = data.frame(
+   g1=c(5,5,6,6,NA,NA),
+   g2=c(5,5,NA,NA,6,6),
+   g3=c(NA,NA,5,5,6,6),
+   treat=c("T1","T1","T2","T2","T3","T3"),
+   class=c("A","B","A","B","A","B"))
> conts = list(treat="contr.sum", class="contr.treatment")
> X = t(df[,1:3])
> D = model.matrix( ~ class + treat, data=df, contrasts=conts)
> Fit = lmFit(X, D)
> options(digits=3)
> Fit$coef
    (Intercept)   classB treat1 treat2
g1         6.0 3.15e-16   -1.0     NA
g2         5.5 3.15e-16   -0.5     NA
g3         5.0 2.31e-16   -1.0     NA
> conts = list(treat="contr.sum", class="contr.sum")
> D = model.matrix( ~ class + treat, data=df, contrasts=conts)
> Fit = lmFit(X, D)
> Fit$coef
    (Intercept)    class1 treat1 treat2
g1         6.0 -1.80e-16   -1.0     NA
g2         5.5 -1.80e-16   -0.5     NA
g3         5.0 -1.39e-16   -1.0     NA
> sessionInfo()
R version 2.6.2 (2008-02-08)
i386-pc-mingw32

locale:
LC_COLLATE=English_Australia.1252;LC_CTYPE=English_Australia.1252;LC_M
ONETARY=English_Australia.1252;LC_NUMERIC=C;LC_TIME=English_Australia.
1252

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base

other attached packages:
[1] limma_2.12.0

ADD COMMENT • link 16.1 years ago • updated 4.2 years ago Gordon Smyth 50k

0

Entering edit mode

Dear Gordon,

thank you very much for your reply!

The output that you give does not match your code.[...]

I am sorry for that. I wrote conts = list(treat="contr.sum", class="contr.treatment") in my posting but actually used conts = list(treat="contr.treatment", class="contr.treatment").

limma treats missing values in exactly the same way as other linear model function in R, such as lm(), glm() etc. For each gene, the rows with missing values are removed both from the data and design matrix, then non-estimable coefficients are removed (given NA) from the end of the design matrix. See the documentation for those functions for more information.

Yes. This behaviour makes sense for lm(), where only one model is fitted and the user knows that a coefficient can not be estimated due to NAs.

I found it confusing how limma stores and returns the results. Consider an even shorter version of the example in my previous mail (attached at the end of this mail). The design matrix is modeled such that the coefficient "treatT2" is the difference between T2 and T1 (T1 is the baseline). The column "treatT2" of Fit$coefficients stores the estimated difference T2-T1 for gene 1 but T2-T3 for gene 3. Calling topTable() with coef="treatT2" to compare T2 and T1 is misleading.

The problem is that often a user will not notice when limma drops a column of the design matrix for a few of many thousand genes. Maybe a warning message should be printed?

The ideal solution is not to introduce missing values into your data in the first place. In my experimence, missing values are almost always avoidable. I have never seen a situation where it was necessary or desirable to introduce a large proportion of missing values.

About 10%-15% of the spots are flagged by the image analysis software (GenePix) mainly due to saturation. Perhaps the scanner was poorly adjusted. Probably it is a good idea to include them in the analysis with low weights?

Best wishes, Hans-Ulrich

Here is the example:

df = data.frame(g1=c(5,5,6,6,NA,NA),
                 g2=c(5,5,NA,NA,6,6),
                 g3=c(NA,NA,5,5,6,6),
                 treat=c("T1","T1","T2","T2","T3","T3"))
conts = list(treat="contr.treatment")
X = t(df[,1:3])
D = model.matrix( ~ treat, data=df, contrasts=conts)
Fit = lmFit(X, D)
Fit$coefficients
    (Intercept) treatT2 treatT3
g1           5       1      NA
g2           5      NA       1
g3           6      -1      NA

ADD REPLY • link updated 16.1 years ago by Gordon Smyth 50k • written 16.1 years ago by Hans-Ulrich Klein ▴ 330

0

Entering edit mode

Yes. This behaviour makes sense for lm(), where only one model is fitted and the user knows that a coefficient can not be estimated due to NAs.

I find it hard to accept that a behaviour can be sensible for lm() but not lmFit(), since they're both fitting linear models. On the other hand, the behaviour may not be sensible for lm() either. See below.

I found it confusing how limma stores and returns the results. Consider an even shorter version of the example in my previous mail (attached at the end of this mail). The design matrix is modeled such that the coefficient "treatT2" is the difference between T2 and T1 (T1 is the baseline). The column "treatT2" of Fit$coefficients stores the estimated difference T2-T1 for gene 1 but T2-T3 for gene 3. Calling topTable() with coef="treatT2" to compare T2 and T1 is misleading.

I understand your point. However it is hard to think of another strategy for accommodating missing values in lmFit() that would be universally applicable and interprettable. Can you suggest one?

Note that lm() has the same problem. Using lm() you could fit the model for gene3 either by

> lm(X[3,]~0+D)$coef
D(Intercept)     DtreatT2     DtreatT3
           6           -1           NA

or by

> lm(X[3,]~df$treat)$coef
(Intercept)  df$treatT3
           5           1

and both are wrong, in the sense that the T2 and T3 coefficients cannot be interpretted as the T2-T1 and T3-T1 contrasts.

Note that you can use limma as it is now to get exactly the correct answer in all cases. You could proceed by

design <- model.matrix(~0+df$treat)
fit <- lmFit(X,design)

Now if you want to compare T2 to T1, you could remove T3 and compute the contrast by

> fit2 <- contrasts.fit(fit[,-3],c(-1,1))
> fit2$coef
   [,1]
g1    1
g2   NA
g3   NA

which is exactly correct. Similarly to compare T3 to T1

> fit3 <- contrasts.fit(fit[,-2],c(-1,1))
> fit3$coef
    [,1]
g1   NA
g2    1
g3   NA

The problem is that often a user will not notice when limma drops a column of the design matrix for a few of many thousand genes. Maybe a warning message should be printed?

Maybe. On one hand, I'm reluctant to introduce a warning a message in a situation in which lm() does not. One the other hand, I agree with you that linear model formulae and missing values don't mix when one of the coefficients becomes inestimable. The only way to get the correct answer is to use the group-mean parametrisation and form contrasts explicitly.

About 10%-15% of the spots are flagged by the image analysis software (GenePix) mainly due to saturation. Perhaps the scanner was poorly adjusted. Probably it is a good idea to include them in the analysis with low weights?

Yes, definitely, it would be far better to include them, whether or not you use low weights, in fact it would be wrong not to do so. Marking these spots as NA is incorrect, because linear model functions such as lmFit() have to assume that they are missing-at-random, which they are not. They are actually the largest intensities in the dataset.

Similar remarks apply to other spots flagged by GenePix because they are faint. Again it would be incorrect to convert them to NA or to give them zero weight. They do contain information.

Best wishes
Gordon

ADD REPLY • link 16.1 years ago • updated 4.2 years ago Gordon Smyth 50k

0

Entering edit mode

The problem is that often a user will not notice when limma drops a column of the design matrix for a few of many thousand genes. Maybe a warning message should be printed?

I have now added a warning message to lmFit() in this situation.

Best wishes
Gordon

ADD REPLY • link 16.1 years ago • updated 4.2 years ago Gordon Smyth 50k