Question

Including squared covariates to account for non-linear dependencies in differential gene expression analysis

1

Entering edit mode

nhaus ▴ 70

@789c70a6

Last seen 12 weeks ago

Switzerland

Hello,

I am currently figuring out the best model for my differential gene expression analysis and thought about including squared covariate terms to account for non-linear dependencies.

Specifically I am thinking about the covariates age (of my patients) and RNA integrity, so that the model would look like this: ~0+disease_status+age+age^2+RIN+RIN^2+.... (RIN^2 is an additional column in my data called RIN_2)

I tried to see how many of my genes are actually correlated with RIN^2 using limma.

If I only include RIN^2 in my design, I get many (>1k) genes that show a significant correlation with RIN^2. However, if I include both RIN and RIN^2 I do not detect any genes that are correlated with either RIN or RIN^2. The actual coefficients for the parameters do not really change, but the estimated standard error increases strongly, which is why they are no longer significant I think. I suspect, that this might be due to the strong correlation between RIN or RIN^2.

I would very much appreciate some insights and thoughts, on whether or not you think that it makes sense to include squared covariates when performing differential gene expression analysis.

limma DifferentialExpression • 1.3k views

ADD COMMENT • link 2.8 years ago nhaus ▴ 70

score 1 · Answer 1 · 2022-03-01

If that's how you are specifying the model, do note that it doesn't work. As an example:

> x <- seq(1,10,1)
> x
 [1]  1  2  3  4  5  6  7  8  9 10
> model.matrix(~x)
   (Intercept)  x
1            1  1
2            1  2
3            1  3
4            1  4
5            1  5
6            1  6
7            1  7
8            1  8
9            1  9
10           1 10
attr(,"assign")
[1] 0 1
> model.matrix(~x+x^2)
   (Intercept)  x
1            1  1
2            1  2
3            1  3
4            1  4
5            1  5
6            1  6
7            1  7
8            1  8
9            1  9
10           1 10
attr(,"assign")
[1] 0 1
> model.matrix(~x+I(x^2))
   (Intercept)  x I(x^2)
1            1  1      1
2            1  2      4
3            1  3      9
4            1  4     16
5            1  5     25
6            1  6     36
7            1  7     49
8            1  8     64
9            1  9     81
10           1 10    100
attr(,"assign")
[1] 0 1 2

Also, you shouldn't use a squared term if you don't also have the linear term. As another example, try this

y <- c(seq(1,6,1), 6.6, rep(7, 3))
plot(x,y)
fit <- lm(y~x)
fit2 <- lm(y~x+I(x^2))
fit3 <- lm(x~I(x^2))
lines(predict(fit))
lines(predict(fit2), lty = 2)
lines(predict(fit3), lty = 3, col = "red")

score 1 · Answer 2 · 2022-03-01

1

Entering edit mode

Gordon Smyth 52k

@gordon-smyth

Last seen 7 minutes ago

WEHI, Melbourne, Australia

If you want to fit quadratic trends in age and RIN, it is better to do it like this:

~0+disease_status+poly(age,2)+poly(RIN,2)

The use of orthogonal polynomials with poly eliminates the correlation between the linear and quadratic terms. See http://www.statsci.org/smyth/pubs/PolyApprox-Preprint.pdf

ADD COMMENT • link 2.8 years ago Gordon Smyth 52k