Hello,
I have rnaseq data generated from brain tissues of 100 individuals, half of them were affected by a disease. All the samples were mix of two different brain regions, obtained from 4 different providers. I am trying to run DESeq2 to identify
differentially expressed genes using the model like: ~age + sex + tissue + PMI + condition, where PMI is post-mortem interval. My colData looks like this:
condition age sex tissue PMI RIN brain_bank
A016-93 disease 79 M FCBA89 48.00 6.400000 LND
A032-95 disease 71 F FCBA89 65.00 6.550000 LND
A046-94 disease 70 F FCBA89 100.00 5.400000 LND
AN00216 disease 63 M FCBA89 26.16 6.150000 KBC
AN02738 disease 85 F FCBA89 12.80 6.200000 KBC
SD001-07 disease 37 F IF 46.00 6.200000 MII
.
.
.
AN03019 control 47 F FCBA89 31.85 5.300000 KBC
SD004-10 control 50 M IF 63.00 5.400000 MII
A046-91 control 41 M IF 49.00 5.450000 LND
Here condition, sex, tissue, brain_bank are as factors and age, RIN, PMI are numeric variables. The RIN values are averaged values because all the samples were sequenced in duplicates.
I want to include brain_bank as a covariate in the model but when I this: ~age + sex + tissue + brain_bank + PMI + condition, it gives me the following error:
the design formula contains a numeric variable with integer values,
specifying a model with increasing fold change for higher values.
did you mean for this to be a factor? if so, first convert
this variable to a factor using the factor() function
Show Traceback
Rerun with Debug
Error in checkFullRank(modelMatrix) :
the model matrix is not full rank, so the model cannot be fit as specified.
One or more variables or interaction terms in the design formula are linear
combinations of the others and must be removed.
I have seen this error several times but being novice in statistics I am never able to understand this error. If anyone can suggest a good reading material about it, would be great.
Anyways following are my questions:
1) Is there some way to incorporate brain_bank as a covariate in the analysis or I should simply go for the first model?
2) Should I use mean RIN values in the model to account for the effect? Any other suggestions?
3) Not sure if this should asked as a separate question. I also want to run DEXseq in order to find diff. exon usage between cases and controls. So can I use the following model:
~sample + exon + exon:condition + age:exon + sex:exon + tissue:exon + PMI:exon ## Full formula
~sample + exon + age:exon + sex:exon + tissue:exon + PMI:exon ## Reduced formula
Thanks in advance for helping.
"The model matrix is not full rank" means that brain_bank is not linearly independent from your other variables, so there is no way for the fitting to estimate the brain_bank effect if you also want to include all your other variables - it can be tricky to see why exactly this is, but from the excerpt of the data you provide, it looks like MII is only providing 'IF' tissues, so if this holds for the rest of the colData, then there is no way of knowing if the correlation is due to that bank or to the tissue. You'll have to omit the brain_bank variable: this means you're bundling up the brain_bank variability into the noise, but you may be susceptible to the error of claiming genes being differential due to tissue type, when they are actually partially due to systematic differences between MII and other brain_banks (unlikely, in my experience - you have multiple banks contributing IF samples, so the main concern is the slight increase in residual noise due to not correcting for bank, but that could well be negligible).
Similarly, inclusion of other terms such as RIN is part of the craft of model-building. Excluding it means that you might miss biologically relevant genes whose expression have a linear dependency on RIN (due to the increased residual noise); including the term means you might miss a gene with a spurious correlation with RIN but a real correlation with biology. It's hard to give generic advice, but if there's strong correlation between technical variables such as RIN and biological variables, then you need to think much harder about what's going on. If not, then inclusion/exclusion of the technical variables should make little practical difference (but if it does, then again, you'll need to think hard, and perhaps examine the behaviour of a few key genes that depend heavily on the choice of model).