Hello,
I am using limma to determine differential gene expression between healthy and KO mice.
In my design matrix, I am including several covariates that I know influence gene expression, but that I am not interested in.
Specifically, it looks something like this: 0+disease_status+age+batch+onset
.
For disease_status
there are only two value (diseased or healthy) and onset
describes the site where the first symptoms occurred.
The problem is, that for all healthy animals, the value for onset
is "Undefined", because obviously there is no site of onset because they are healthy.
This basically means, that the "healthy" samples of disease_status
are confounded with onset
. I think that is the reason why I get the following warning when I run limma:
Coefficients not estimable: ...
Is there a way to adjust the design matrix so this problem does not occur or how should I handle this issue?
Any insights are much appreciated!
Thank you very much for your quick reply.
Just for my understanding, wouldnt that mean that I decrease my power to detect differentially expressed genes, because I now have fewer samples in the different „diseased“ groups?
And what is the consequence if I just continue using the naïve model where I do not combine disease_status and onset? Dispite the warning I get many differentially expressed genes that make biological sense.
No it doesn't. If you want to compare healthy to diseased overall then you form a contrast like
(diseased.onset1 + diseased.onset2)/2 - healthy
, which uses all the samples. Combining the two factors into one has no disadvantages at all compared to the model you are fitting now.The naive model may be ok or wrong depending on which coefficients limma has removed automatically. The removed coefficients are the ones listed as non-estimable (which you didn't show in your question).
In any case, the naive model means that when you test for disease.status you are actually only comparing healthy to just one of the onset sites (whichever one is the reference). So you are using fewer samples than if you were to fix the over-parametrization problem as I suggested and you are not really making a fair comparison of diseased to healthy overall.
Brilliant, thank you very much for your help! Everything is much clearer now
I have a very similar question, so thought of asking in the comment.
If I have three groups, disease1, disease2 and healthy. For disease1, I have two values (0 without steroid treatment ,1 with steroid treatment). For disease2 I have two values (0 without steroid treatment ,1 with steroid treatment). For healthy NA values
How would be the design matrix and how will I define contrast if I would like to find whether or not steroid treatment has any influence on differentially expressed genes between disease1 vs healthy, disease1 vs disease2 and disease2 vs healthy?