Question

limma overfitting model due to intercept and yet exclusion of intercept= uninterpretable results

0

Entering edit mode

christopher.clarskon15 ▴ 20

@christopherclarskon15-9965

Last seen 8.1 years ago

I have microarray data for which the columns represent the disease condition status of each individual.

I plan to fit a linear model and then make a contrast amongst the different columns using limma. My question is: how do I make the most informative comparison when contrasting all of the columns against all of the other disease stasuses...

Other<-ifelse(Information_file$Diagnostic.category=="Other diagnoses", 1,0)
TB<-ifelse(Information_file$Diagnostic.category=="TB", 1,0)

So as you can see above I am trying to test for a significant difference between TB and all other diagnoses and I need to eliminate any confounding to see if there are any differentially expressed genes that are genuinely due to TB and not just one of the other diagnoses....

The first attempt that I made at this was using the following design matrix:

   > head(design)
       TB Intercept Other Sex Age
  [1,]  1         1     0   1 155
  [2,]  1         1     0   2  16
  [3,]  1         1     0   1  22
  [4,]  1         1     0   1 114
  [5,]  1         1     0   2  56
  [6,]  1         1     0   2  47

And the following code produced a nice plot:

  fit<-lmFit(E.ncRNA1, design)
  contrast.matrix <- makeContrasts(TB, levels=design)
  fit <- contrasts.fit(fit, contrast.matrix)
  fit <- eBayes(fit)
  volcanoplot(fit, highlight=10, main="TB vs Everything")

However my supervisor says that this is not statistically robust as:

including an intercept over-specifies the model
he'd like to know if there's a non-zero difference between the TB and Other columns i.e. specifying 1 for the TB covariate and -1 for the OD covariate

Hence, lets say my new design matrix is designed as follows:

   > head(design)
       TB Other Sex Age
  [1,]  1     0   1 155
  [2,]  1     0   2  16
  [3,]  1     0   1  22
  [4,]  1     0   1 114
  [5,]  0    -1   2  56
  [6,]  0    -1   2  47

And yet the exclusion of an Intercept gives rise to plots that look like this:

So overall my questions are as follows how do I make the most valuable comparison to test for a non-zero difference between TB and other diseases without including an intercept?

limma microarray linear model statistical modeling model.matrix • 1.5k views

ADD COMMENT • link 7.9 years ago christopher.clarskon15 ▴ 20

score 1 · Answer 1 · 2016-05-20

You don't show how you created the design matrix, so it's not clear what the intercept is. And I am not sure exactly what your supervisor means by 'over-specified'. My interpretation of that term is either a.) the design matrix is not of full rank, or b.) you have extra unnecessary parameters in your model.

If you gave us all the output, then a.) is not true, because limma will tell you if you have non-estimable coefficients, and b.) may or may not be true, but in microarray analyses this is often an unavoidable aspect, since you are fitting thousands of models, some of which may well benefit by having a particular coefficient, and some of which clearly do not.

You don't show enough of the original design matrix, nor the contrast matrix at all, so given the information in hand it isn't possible (for me at least) to say much more. Instead of the plots and just snippets of code, you should give us the least amount of code that will clearly show exactly how you generated the design and contrast matrix.

score 0 · Answer 2 · 2016-05-20

Hi James,

the design matrix was created as follows:

design <- cbind(1, as.numeric(as.factor(clinical1$Sex)), clinical1$Age..mths. ) colnames(design)=c("Intercept", "Sex", "Age") Other<-ifelse(clinical1$Diagnostic.category=="Other diagnoses", 1,0) TB<-ifelse(clinical1$Diagnostic.category=="Culture confirmed TB" | clinical1$Diagnostic.category=="Culture negative TB", 0, -1) design<-cbind(TB, Other, design) colnames(design)=c("TB", "Other", "Intercept", "Sex", "Age")

The contrast matrix looks like this:

> head(design) Intercept TB Other Sex Age [1,] 1 1 -1 1 155 [2,] 1 1 0 2 16 [3,] 1 1 0 1 22 [4,] 1 1 0 1 114 [5,] 1 1 -1 2 56 [6,] 1 1 0 2 47
> fit2<-lmFit(E.ncRNA1, design) > contrast.matrix <- makeContrasts(TB, levels=design) > contrast.matrix Contrasts Levels TB Intercept 0 TB 1 Other 0 Sex 0 Age 0

The Intercept column just provides a null model as it is all 1's I was told to do it this way but I am new to linear algebra- I am checking up the new definition of rank in matrices on Wikipedia... if you know of any good tutorials in terms of using design matrices could you send them my way?? Also there are some individuals who are not either of TB or Other..... thanks