Including squared covariates to account for non-linear dependencies in differential gene expression analysis
2
0
Entering edit mode
nhaus • 0
@789c70a6
Last seen 10 days ago
Switzerland

Hello,

I am currently figuring out the best model for my differential gene expression analysis and thought about including squared covariate terms to account for non-linear dependencies.

Specifically I am thinking about the covariates age (of my patients) and RNA integrity, so that the model would look like this: ~0+disease_status+age+age^2+RIN+RIN^2+.... (RIN^2 is an additional column in my data called RIN_2)

I tried to see how many of my genes are actually correlated with RIN^2 using limma.

If I only include RIN^2 in my design, I get many (>1k) genes that show a significant correlation with RIN^2. However, if I include both RIN and RIN^2 I do not detect any genes that are correlated with either RIN or RIN^2. The actual coefficients for the parameters do not really change, but the estimated standard error increases strongly, which is why they are no longer significant I think. I suspect, that this might be due to the strong correlation between RIN or RIN^2.

I would very much appreciate some insights and thoughts, on whether or not you think that it makes sense to include squared covariates when performing differential gene expression analysis.

limma DifferentialExpression • 248 views
1
Entering edit mode
@james-w-macdonald-5106
Last seen 2 days ago
United States

If that's how you are specifying the model, do note that it doesn't work. As an example:

> x <- seq(1,10,1)
> x
[1]  1  2  3  4  5  6  7  8  9 10
> model.matrix(~x)
(Intercept)  x
1            1  1
2            1  2
3            1  3
4            1  4
5            1  5
6            1  6
7            1  7
8            1  8
9            1  9
10           1 10
attr(,"assign")
[1] 0 1
> model.matrix(~x+x^2)
(Intercept)  x
1            1  1
2            1  2
3            1  3
4            1  4
5            1  5
6            1  6
7            1  7
8            1  8
9            1  9
10           1 10
attr(,"assign")
[1] 0 1
> model.matrix(~x+I(x^2))
(Intercept)  x I(x^2)
1            1  1      1
2            1  2      4
3            1  3      9
4            1  4     16
5            1  5     25
6            1  6     36
7            1  7     49
8            1  8     64
9            1  9     81
10           1 10    100
attr(,"assign")
[1] 0 1 2


Also, you shouldn't use a squared term if you don't also have the linear term. As another example, try this

y <- c(seq(1,6,1), 6.6, rep(7, 3))
plot(x,y)
fit <- lm(y~x)
fit2 <- lm(y~x+I(x^2))
fit3 <- lm(x~I(x^2))
lines(predict(fit))
lines(predict(fit2), lty = 2)
lines(predict(fit3), lty = 3, col = "red")

0
Entering edit mode

Thank you for your comment! I have an additional column with RIN^2 called RIN_2 that I am using. I will edit my question to make this more clear. But I didnt know about the I functionality, which is very helpful!

1
Entering edit mode
@gordon-smyth
Last seen 1 hour ago
WEHI, Melbourne, Australia

If you want to fit quadratic trends in age and RIN, it is better to do it like this:

~0+disease_status+poly(age,2)+poly(RIN,2)


The use of orthogonal polynomials with poly eliminates the correlation between the linear and quadratic terms. See http://www.statsci.org/smyth/pubs/PolyApprox-Preprint.pdf