Including squared covariates to account for non-linear dependencies in differential gene expression analysis
2
1
Entering edit mode
nhaus ▴ 70
@789c70a6
Last seen 12 weeks ago
Switzerland

Hello,

I am currently figuring out the best model for my differential gene expression analysis and thought about including squared covariate terms to account for non-linear dependencies.

Specifically I am thinking about the covariates age (of my patients) and RNA integrity, so that the model would look like this: ~0+disease_status+age+age^2+RIN+RIN^2+.... (RIN^2 is an additional column in my data called RIN_2)

I tried to see how many of my genes are actually correlated with RIN^2 using limma.

If I only include RIN^2 in my design, I get many (>1k) genes that show a significant correlation with RIN^2. However, if I include both RIN and RIN^2 I do not detect any genes that are correlated with either RIN or RIN^2. The actual coefficients for the parameters do not really change, but the estimated standard error increases strongly, which is why they are no longer significant I think. I suspect, that this might be due to the strong correlation between RIN or RIN^2.

I would very much appreciate some insights and thoughts, on whether or not you think that it makes sense to include squared covariates when performing differential gene expression analysis.

limma DifferentialExpression • 1.3k views
ADD COMMENT
1
Entering edit mode
@james-w-macdonald-5106
Last seen 12 hours ago
United States

If that's how you are specifying the model, do note that it doesn't work. As an example:

> x <- seq(1,10,1)
> x
 [1]  1  2  3  4  5  6  7  8  9 10
> model.matrix(~x)
   (Intercept)  x
1            1  1
2            1  2
3            1  3
4            1  4
5            1  5
6            1  6
7            1  7
8            1  8
9            1  9
10           1 10
attr(,"assign")
[1] 0 1
> model.matrix(~x+x^2)
   (Intercept)  x
1            1  1
2            1  2
3            1  3
4            1  4
5            1  5
6            1  6
7            1  7
8            1  8
9            1  9
10           1 10
attr(,"assign")
[1] 0 1
> model.matrix(~x+I(x^2))
   (Intercept)  x I(x^2)
1            1  1      1
2            1  2      4
3            1  3      9
4            1  4     16
5            1  5     25
6            1  6     36
7            1  7     49
8            1  8     64
9            1  9     81
10           1 10    100
attr(,"assign")
[1] 0 1 2

Also, you shouldn't use a squared term if you don't also have the linear term. As another example, try this

y <- c(seq(1,6,1), 6.6, rep(7, 3))
plot(x,y)
fit <- lm(y~x)
fit2 <- lm(y~x+I(x^2))
fit3 <- lm(x~I(x^2))
lines(predict(fit))
lines(predict(fit2), lty = 2)
lines(predict(fit3), lty = 3, col = "red")
ADD COMMENT
0
Entering edit mode

Thank you for your comment! I have an additional column with RIN^2 called RIN_2 that I am using. I will edit my question to make this more clear. But I didnt know about the I functionality, which is very helpful!

ADD REPLY
1
Entering edit mode
@gordon-smyth
Last seen 7 minutes ago
WEHI, Melbourne, Australia

If you want to fit quadratic trends in age and RIN, it is better to do it like this:

~0+disease_status+poly(age,2)+poly(RIN,2)

The use of orthogonal polynomials with poly eliminates the correlation between the linear and quadratic terms. See http://www.statsci.org/smyth/pubs/PolyApprox-Preprint.pdf

ADD COMMENT

Login before adding your answer.

Traffic: 944 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6