Question: Design matrix for continuous independent variable in limma
1
gravatar for Linda Chaba
3.4 years ago by
Linda Chaba10
Kenya
Linda Chaba10 wrote:

I'm using limma for the first time. I have a problem in creating a design matrix. I'm interested in finding genes that are differentially expressed given only one continuous outcome for example age. I have 35 samples. How will my design matrix looks like?

Thank you. 

microarray limma • 2.5k views
ADD COMMENTlink modified 3.4 years ago • written 3.4 years ago by Linda Chaba10
Answer: Design matrix for continuous independent variable in limma
4
gravatar for Aaron Lun
3.4 years ago by
Aaron Lun23k
Cambridge, United Kingdom
Aaron Lun23k wrote:

I would probably do something like this:

age <- runif(35, 20, 50) # dummy variable
require(splines)
X <- ns(age, df=5)
design <- model.matrix(~X)

This will fit a spline to the expression of each gene, using the age of each sample as a covariate. The use of a spline is more flexible than just putting age directly into the design matrix, as it will allow for non-linear trends. Splines with more df provide more flexible fits at the cost of burning up the residual d.f. available for variance estimation. In general, 3-5 d.f. work well as it's rare to get a trend with lots of peaks/troughs in real data. In your case, you've got plenty of samples and age is the only covariate, so you using 5 d.f. won't do too much damage to your variance estimates.

For testing, you can't interpret the spline coefficients individually. Rather, you'll have to drop them all at once:

# Assume we get a fit object out of lmFit/eBayes.
results <- topTable(fit, coef=2:ncol(design), n=Inf, sort.by="none")

This will test for any response of expression to age, be it an increasing/decreasing/squiggly trend. You'll have to look at the results on a gene-by-gene basis to determine what the trend is, because that's not easily summarized into a single statistic. For purposes of seeing whether it's generally up or down with age, you may consider fitting a simpler model using age directly as a covariate in design:

design2 <- model.matrix(~age)
# ... after processing with lmFit/eBayes to get fit2 ...
results2 <- topTable(fit2, coef=2, n=Inf, sort.by="none")

... and reporting the log-fold change from results2 with the p-values in results. Positive log-fold changes correspond to a positive gradient with respect to age (i.e., expression goes up with age). I wouldn't use the p-values from results2, as it assumes linearity.

ADD COMMENTlink modified 3.4 years ago • written 3.4 years ago by Aaron Lun23k

Thank you Aaron for your answer, it was very helpful. However, I'm getting an error (Error: could not find function "topTags") when using "topTags" to get my results. It could be because I'm missing a package. What difference would it have if I use "topTable" to get my results?  I am using limma package.

Thank you

ADD REPLYlink written 3.4 years ago by Linda Chaba10
2

Aaron got his wires crossed a bit, is all. topTags is the analogous function to topTable in edgeR. You should go forward with using topTable

ADD REPLYlink written 3.4 years ago by Steve Lianoglou12k

Thank you Steve. 

ADD REPLYlink written 3.4 years ago by Linda Chaba10

Thanks Steve, fixed above.

ADD REPLYlink written 3.4 years ago by Aaron Lun23k

Most people like to see which genes are more significant, so allow topTable() to sort the results for you:

topTable(fit, n=20)

say.

ADD REPLYlink written 3.4 years ago by Gordon Smyth37k

The reason why I set sort.by="none" was to allow for easier cross-referencing between results and results2, if the gradient of the age response was of interest. Otherwise, we'd have to use match or merge.

ADD REPLYlink written 3.4 years ago by Aaron Lun23k

Dear Aaron,

One more question.

I don't have raw data set.I have  processed (normalized and filtered) data set consisting of log2 rations of 23,360 genes on 35 samples in an excel format. I imported the data the usual way and used it to get a fit object out of lmFit/eBayes. My question is, is it okay to use processed data since I see most examples using raw data set. Secondly, with the results2 you mentioned, can I use adjusted p-values to get differentially expressed genes?  I'm adjusting using fdr approach.

Thank you

ADD REPLYlink written 3.4 years ago by Linda Chaba10
1
  1. This depends on how the data is normalized and filtered, but generally speaking, yes. lmFit expects log-transformed expression values after normalization and filtering - see Chapters 6 and 7 of the limma user's guide.
  2. Use the adjusted p-values from results, not results2. The only purpose of results2 is to give you a single log-fold change statistic that you can interpret as the gradient with respect to age. The fit with the splines will provide more appropriate (adjusted) p-values in results as it doesn't assume linearity in expression with respect to age.
ADD REPLYlink modified 3.4 years ago • written 3.4 years ago by Aaron Lun23k

Okay. Thanks

ADD REPLYlink written 3.4 years ago by Linda Chaba10

Hi Aaron, you said that using the spline approach we can look at the results on a "gene-by-gene" basis to interpret the results. How would that be? I am unable to interpret the spline coefficients ....

I have run a very similar analysis:
This is the script I used to get p-values using splines:

require(splines)
x<-ns(pheno_UMary$Age, df=5)
design<-model.matrix(~x)
fit<-lmFit(counts_UMary, design = design)
fit<- eBayes(fit, robust = TRUE)
results<-topTable(fit coef = 2:ncol(design), n=Inf, sort.by = "none", p.value = 0.05)
head(View(results))

I then run the second model is a simpler model using age directly as a covariate.

I am a little confused about how to interpret the x1-5 coefficients obtained from the splines. Also, even though we "sort.by = none" for both, the order is different.
 

ADD REPLYlink modified 10 months ago • written 10 months ago by anupriya.dalmia0

Please create a new post for a new question.

ADD REPLYlink written 10 months ago by Aaron Lun23k

Done, sorry.

ADD REPLYlink written 10 months ago by anupriya.dalmia0

Hello Aaron,

does this work also in the more "extreme" case when age is constant, like for example age <- rep(1,35) instead of age <- runif(35, 20, 50) Housekeeping/reference genes are expected to be more or less constantly expressed.

ADD REPLYlink written 11 weeks ago by Daniel10

I would prefer if you asked a new question, otherwise these threads end up being too long.

The answer is no, it doesn't work because your "extreme" case is broken. If all the ages are constant, how can one hope to determine the effect of age? You will find that your design matrix is not of full rank, assuming that splines::ns doesn't spit the dummy beforehand.

Whether a gene is housekeeping or not is largely irrelevant to the construction of the design matrix.

ADD REPLYlink written 11 weeks ago by Aaron Lun23k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 16.09
Traffic: 140 users visited in the last hour