Hello,
I am currently figuring out the best model for my differential gene expression analysis and thought about including squared covariate terms to account for non-linear dependencies.
Specifically I am thinking about the covariates age (of my patients) and RNA integrity, so that the model would look like this:
~0+disease_status+age+age^2+RIN+RIN^2+....
(RIN^2
is an additional column in my data called RIN_2
)
I tried to see how many of my genes are actually correlated with RIN^2
using limma.
If I only include RIN^2
in my design, I get many (>1k) genes that show a significant correlation with RIN^2
. However, if I include both RIN
and RIN^2
I do not detect any genes that are correlated with either RIN
or RIN^2
. The actual coefficients for the parameters do not really change, but the estimated standard error increases strongly, which is why they are no longer significant I think. I suspect, that this might be due to the strong correlation between RIN
or RIN^2
.
I would very much appreciate some insights and thoughts, on whether or not you think that it makes sense to include squared covariates when performing differential gene expression analysis.
Thank you for your comment! I have an additional column with
RIN^2
calledRIN_2
that I am using. I will edit my question to make this more clear. But I didnt know about theI
functionality, which is very helpful!