What to do if covariates in design matrix of limma are confounded
1
0
Entering edit mode
nhaus • 0
@789c70a6
Last seen 6 days ago
Switzerland

Hello,

I am using limma to determine differential gene expression between healthy and KO mice. In my design matrix, I am including several covariates that I know influence gene expression, but that I am not interested in. Specifically, it looks something like this: 0+disease_status+age+batch+onset.

For disease_status there are only two value (diseased or healthy) and onset describes the site where the first symptoms occurred.

The problem is, that for all healthy animals, the value for onset is "Undefined", because obviously there is no site of onset because they are healthy.

This basically means, that the "healthy" samples of disease_status are confounded with onset. I think that is the reason why I get the following warning when I run limma:

Coefficients not estimable: ...

Is there a way to adjust the design matrix so this problem does not occur or how should I handle this issue?

Any insights are much appreciated!

limma DifferentialExpression • 467 views
4
Entering edit mode
@gordon-smyth
Last seen 2 hours ago
WEHI, Melbourne, Australia

You need to combine disease_status and onset into one factor taking values "healthy", "diseased.onset1", "diseased.onset2" etc.

0
Entering edit mode

Just for my understanding, wouldnt that mean that I decrease my power to detect differentially expressed genes, because I now have fewer samples in the different „diseased“ groups?

And what is the consequence if I just continue using the naïve model where I do not combine disease_status and onset? Dispite the warning I get many differentially expressed genes that make biological sense.

0
Entering edit mode

wouldnt that mean that I decrease my power

No it doesn't. If you want to compare healthy to diseased overall then you form a contrast like (diseased.onset1 + diseased.onset2)/2 - healthy, which uses all the samples. Combining the two factors into one has no disadvantages at all compared to the model you are fitting now.

what is the consequence if I just continue using the naïve model

The naive model may be ok or wrong depending on which coefficients limma has removed automatically. The removed coefficients are the ones listed as non-estimable (which you didn't show in your question).

In any case, the naive model means that when you test for disease.status you are actually only comparing healthy to just one of the onset sites (whichever one is the reference). So you are using fewer samples than if you were to fix the over-parametrization problem as I suggested and you are not really making a fair comparison of diseased to healthy overall.

0
Entering edit mode

Brilliant, thank you very much for your help! Everything is much clearer now

0
Entering edit mode

I have a very similar question, so thought of asking in the comment.

If I have three groups, disease1, disease2 and healthy. For disease1, I have two values (0 without steroid treatment ,1 with steroid treatment). For disease2 I have two values (0 without steroid treatment ,1 with steroid treatment). For healthy NA values

How would be the design matrix and how will I define contrast if I would like to find whether or not steroid treatment has any influence on differentially expressed genes between disease1 vs healthy, disease1 vs disease2 and disease2 vs healthy?