16 months ago by
Cambridge, United Kingdom
I would probably do something like this:
age <- runif(35, 20, 50) # dummy variable
X <- ns(age, df=5)
design <- model.matrix(~X)
This will fit a spline to the expression of each gene, using the age of each sample as a covariate. The use of a spline is more flexible than just putting
age directly into the design matrix, as it will allow for non-linear trends. Splines with more
df provide more flexible fits at the cost of burning up the residual d.f. available for variance estimation. In general, 3-5 d.f. work well as it's rare to get a trend with lots of peaks/troughs in real data. In your case, you've got plenty of samples and age is the only covariate, so you using 5 d.f. won't do too much damage to your variance estimates.
For testing, you can't interpret the spline coefficients individually. Rather, you'll have to drop them all at once:
# Assume we get a fit object out of lmFit/eBayes.
results <- topTable(fit, coef=2:ncol(design), n=Inf, sort.by="none")
This will test for any response of expression to age, be it an increasing/decreasing/squiggly trend. You'll have to look at the results on a gene-by-gene basis to determine what the trend is, because that's not easily summarized into a single statistic. For purposes of seeing whether it's generally up or down with age, you may consider fitting a simpler model using
age directly as a covariate in
design2 <- model.matrix(~age)
# ... after processing with lmFit/eBayes to get fit2 ...
results2 <- topTable(fit2, coef=2, n=Inf, sort.by="none")
... and reporting the log-fold change from
results2 with the p-values in
results. Positive log-fold changes correspond to a positive gradient with respect to age (i.e., expression goes up with age). I wouldn't use the p-values from
results2, as it assumes linearity.
modified 16 months ago
16 months ago by
Aaron Lun • 14k