limma overfitting model due to intercept and yet exclusion of intercept= uninterpretable results
2
0
Entering edit mode
@christopherclarskon15-9965
Last seen 5.5 years ago

I have microarray data for which the columns represent the disease condition status of each individual.

I plan to fit a linear model and then make a contrast amongst the different columns using limma. My question is: how do I make the most informative comparison when contrasting all of the columns against all of the other disease stasuses...

Other<-ifelse(Information_file$Diagnostic.category=="Other diagnoses", 1,0) TB<-ifelse(Information_file$Diagnostic.category=="TB", 1,0)

So as you can see above I am trying to test for a significant difference between TB and all other diagnoses and I need to eliminate any confounding to see if there are any differentially expressed genes that are genuinely due to TB and not just one of the other diagnoses....

The first attempt that I made at this was using the following design matrix:

   > head(design)
TB Intercept Other Sex Age
[1,]  1         1     0   1 155
[2,]  1         1     0   2  16
[3,]  1         1     0   1  22
[4,]  1         1     0   1 114
[5,]  1         1     0   2  56
[6,]  1         1     0   2  47

And the following code produced a nice plot:

  fit<-lmFit(E.ncRNA1, design)
contrast.matrix <- makeContrasts(TB, levels=design)
fit <- contrasts.fit(fit, contrast.matrix)
fit <- eBayes(fit)
volcanoplot(fit, highlight=10, main="TB vs Everything")

However my supervisor says that this is not statistically robust as:

1. including an intercept over-specifies the model
2. he'd like to know if there's a non-zero difference between the TB and Other columns i.e. specifying 1 for the TB covariate and -1 for the OD covariate

Hence, lets say my new design matrix is designed as follows:

   > head(design)
TB Other Sex Age
[1,]  1     0   1 155
[2,]  1     0   2  16
[3,]  1     0   1  22
[4,]  1     0   1 114
[5,]  0    -1   2  56
[6,]  0    -1   2  47

And yet the exclusion of an Intercept gives rise to plots that look like this:

So overall my questions are as follows how do I make the most valuable comparison to test for a non-zero difference between TB and other diseases without including an intercept?

1
Entering edit mode
@james-w-macdonald-5106
Last seen 17 hours ago
United States

You don't show how you created the design matrix, so it's not clear what the intercept is. And I am not sure exactly what your supervisor means by 'over-specified'. My interpretation of that term is either a.) the design matrix is not of full rank, or b.) you have extra unnecessary parameters in your model.

If you gave us all the output, then a.) is not true, because limma will tell you if you have non-estimable coefficients, and b.) may or may not be true, but in microarray analyses this is often an unavoidable aspect, since you are fitting thousands of models, some of which may well benefit by having a particular coefficient, and some of which clearly do not.

You don't show enough of the original design matrix, nor the contrast matrix at all, so given the information in hand it isn't possible (for me at least) to say much more. Instead of the plots and just snippets of code, you should give us the least amount of code that will clearly show exactly how you generated the design and contrast matrix.

0
Entering edit mode
@christopherclarskon15-9965
Last seen 5.5 years ago

Hi James,

the design matrix was created as follows:

design <- cbind(1, as.numeric(as.factor(clinical1$Sex)), clinical1$Age..mths. ) colnames(design)=c("Intercept", "Sex", "Age") Other<-ifelse(clinical1$Diagnostic.category=="Other diagnoses", 1,0) TB<-ifelse(clinical1$Diagnostic.category=="Culture confirmed TB" | clinical1$Diagnostic.category=="Culture negative TB", 0, -1) design<-cbind(TB, Other, design) colnames(design)=c("TB", "Other", "Intercept", "Sex", "Age") The contrast matrix looks like this: > head(design) Intercept TB Other Sex Age [1,] 1 1 -1 1 155 [2,] 1 1 0 2 16 [3,] 1 1 0 1 22 [4,] 1 1 0 1 114 [5,] 1 1 -1 2 56 [6,] 1 1 0 2 47 > fit2<-lmFit(E.ncRNA1, design) > contrast.matrix <- makeContrasts(TB, levels=design) > contrast.matrix Contrasts Levels TB Intercept 0 TB 1 Other 0 Sex 0 Age 0 The Intercept column just provides a null model as it is all 1's I was told to do it this way but I am new to linear algebra- I am checking up the new definition of rank in matrices on Wikipedia... if you know of any good tutorials in terms of using design matrices could you send them my way?? Also there are some individuals who are not either of TB or Other..... thanks ADD COMMENT 0 Entering edit mode If I'm getting you right, and you are just trying to test for TB effects, and control for age and sex (assuming they are not confounded with each other), then assuming that: 1. The only two values in clinical1$Diagnostic.category are "Other diagnosis" and "TB"; and
2. You can convert the 1/0 encoding in Sex to M/F like I will below; and
3. Maybe bi (tri (whatever))-chotomize age? Define the breaks in age.breaks (like 'young' and 'old', or 'young', 'medium', 'old', etc.)

I'd construct your design like so:

clin <- transform(clinical1,
status=ifelse(Diagnostic.category == 'TB', 'TB', 'other'),
sex=ifelse(Sex == 1, 'M', 'F'),
age=cut(Age..mths., age.breaks))
des <- model.matrix(~ status + sex + age, clin)

You can then test against the second coefficient (probably names something like statusTB, right?)

fit <- eBayes(lmFit(E.ncRNA1, des))
res <- topTable(fit, coef='statusTB', n=Inf)

If you want to further split your "other" diagnostic category into more specific diseases (assuming you have enough degrees of freedom and your ages and sex span the disease states) you can do similar moves.