Question: Design matrix for regressing gene expression on a continuous variable - age
0
11 months ago by
anupriya.dalmia0 wrote:

I have 74 samples with log normalised gene counts. I want to see their expression pattern changes with age (range 2-72 years). I am not binning the ages and using them as a continuous variable.

This is the script I used to get p-values using splines:

require(splines)
x<-ns(pheno_UMary$Age, df=5) design<-model.matrix(~x) fit<-lmFit(counts_UMary, design = design) fit<- eBayes(fit, robust = TRUE) results<-topTable(fit coef = 2:ncol(design), n=Inf, sort.by = "none") head(View(results)) I am a little confused about how to interpret the x1-5 coefficients obtained from the splines. I then run the second model is a simpler model using age directly as a covariate and use the logFC from that. Since this model assumes linearity, I will consider the adjusted p-values from the first model. Also, even though we "sort.by = none" for both, the order is different so it is quite difficult to compare the results. And the p-values are extremely different as well. limma R lmfit design matrix • 294 views ADD COMMENTlink modified 11 months ago by Aaron Lun24k • written 11 months ago by anupriya.dalmia0 Answer: Design matrix for regressing gene expression on a continuous variable - age 0 11 months ago by Aaron Lun24k Cambridge, United Kingdom Aaron Lun24k wrote: The estimated values of spline coefficients are extremely difficult to interpret - don't bother trying. The only useful thing to get out of the spline model is the p-value, which tests the null hypothesis that all spline coefficients are equal to zero, i.e., there is no trend of any kind. I usually combine the p-value from the spline model with the log-fold change from a model with a linear fit. The former allows for non-linear trends during the significance calculations, while the latter provides an interpretable summary of the change (i.e., does expression tend to go up or down?). However, the only general way to characterize the trend is to plot the fitted values against time for each gene. I mean, how would a single statistic distinguish between (i) a trend that goes linearly, (ii) a trend that goes up early and plateaus, or (iii) a trend that is constant early and then goes up late? And that's not even considering trends that change direction over time. Obviously, if you have non-linear trends, the p-values will be different. In the simplest case, a gene that goes up and then down will be fitted to a line with a gradient of zero (thus non-significant), but will exhibit a significant trend with the spline. For biological systems, I would trust the p-values from the spline model much more, as the assumption of linearity is dubious. ADD COMMENTlink modified 11 months ago • written 11 months ago by Aaron Lun24k Thanks for your reply, Aaron. So what I gather here is that the p-values from the spline model will help us interpret if there is any overall trend at all (without forcing linearity) but the logFC values from the simple model would determine whether that gene is, on the whole, going up/down with age? Would you recommend binning the age groups into categorical variables? ADD REPLYlink written 11 months ago by anupriya.dalmia0 1. Yes. Though obviously, if a gene steadily goes up with age and then suddenly drops down at large values, it will probably still have a positive log-fold change. Remember that the log-fold change will only tell you the direction on average. 2. There is an argument for a small number of bins for interpretation purposes only, e.g., to report the log-fold change of each bin from the first. This would provide a bit more information than the overall log-fold change, as you could compare different responses across different bins. However, don't use too many bins, as then you will eventually run into the same difficulties of interpretation. Also, the p-value should still be computed using the spline model; there are statistical issues with correlations within each bin if the bins are too large, and loss of residual d.f. if the bins are too small. One approach to simplify interpretation would be to (i) find all DE genes using the spline model, (ii) cluster the genes based on their fitted values, and (iii) annotate the clusters, e.g., early responders, late responders, peaking genes. This will allow you to effectively annotate whole blocks of genes at once, which should be easier than going gene-by-gene. ADD REPLYlink modified 11 months ago • written 11 months ago by Aaron Lun24k Hi Aaron, Thanks again for your reply. >One approach to simplify interpretation would be to (i) find all DE genes using the spline model, (ii) cluster the genes based on their fitted values, and (iii) annotate the clusters, e.g., early responders, late responders, peaking genes. This will allow you to effectively annotate whole blocks of genes at once, which should be easier than going gene-by-gene. Here - what do you mean by clustering them according to their fitted values? Do you mean like a k-means clustering of the logged counts of the DE genes? Also, the absolute values of logFCs from my simple model are all quite low - in the range [-0.04, 0.03] so I am not sure if this is even a worthwhile analysis. Although, I am getting high values of logFC if I use bins, which doesn't seem too promising or reliable. ADD REPLYlink modified 11 months ago • written 11 months ago by anupriya.dalmia0 You can get the matrix of fitted values with: fit$coefficients %*% t(fit\$design)

... and then apply k-means clustering (or whatever algorithm of your choice) to it. Though, I would recommend subsetting the matrix to only use DE genes - this should improve the separation between genes with differences in behaviour. Otherwise you'll probably just get a big clump of genes, as the non-DE genes "fill in" the gaps between clusters.

For a linear model, the size of the log-fold change must always be interpreted relative to the units. For example, a linear model where the covariate is expressed in years will have a quite different log-fold change compared to units in months or days. I would focus on the sign if you're going to use the log-fold change for interpretation.

Thanks again, Aaron. Okay so I was able to cluster the DE genes based on the fitted values and also plotted the logged values of the genes as line graphs against age but do not see any specific trends. Would it be better to plot the fitted values instead?

Also, are there any other ways in which I could annotate these clusters as "early responders, late responders, peaking genes" etc without looking at them gene-by-gene, as you suggested?

I am sorry for so many questions.

It would probably be better to plot the fitted values. Obviously you should be making one plot per cluster, you won't see anything if you throw all genes together. Just take the average of the fitted values per cluster and this will tell you what the behaviour is on average for that cluster.

Yes, I plotted different clusters separately, of course :-)

So I have lost the sample names (colnames) while running the analysis. To be able to plot these graphs, I would need the meta-information that correlates the sample name with their age. Is there a way to get this information back?

I presume the order does not change so I could just replace them with sample names but that seems slightly risky.