Question

Design matrix for regressing gene expression on a continuous variable - age

0

Entering edit mode

anupriya.dalmia • 0

@anupriyadalmia-15435

Last seen 5.8 years ago

I have 74 samples with log normalised gene counts. I want to see their expression pattern changes with age (range 2-72 years). I am not binning the ages and using them as a continuous variable.

This is the script I used to get p-values using splines:

require(splines)
x<-ns(pheno_UMary$Age, df=5)
design<-model.matrix(~x)
fit<-lmFit(counts_UMary, design = design)
fit<- eBayes(fit, robust = TRUE)
results<-topTable(fit coef = 2:ncol(design), n=Inf, sort.by = "none")
head(View(results))

I am a little confused about how to interpret the x1-5 coefficients obtained from the splines.

I then run the second model is a simpler model using age directly as a covariate and use the logFC from that. Since this model assumes linearity, I will consider the adjusted p-values from the first model.

Also, even though we "sort.by = none" for both, the order is different so it is quite difficult to compare the results. And the p-values are extremely different as well.

limma lmfit design matrix r • 1.6k views

ADD COMMENT • link updated 5.7 years ago by Aaron Lun ★ 28k • written 5.7 years ago by anupriya.dalmia • 0

score 1 · Answer 1 · 2018-07-20

1

Entering edit mode

Aaron Lun ★ 28k

@alun

Last seen 6 hours ago

The city by the bay

The estimated values of spline coefficients are extremely difficult to interpret - don't bother trying. The only useful thing to get out of the spline model is the p-value, which tests the null hypothesis that all spline coefficients are equal to zero, i.e., there is no trend of any kind.

I usually combine the p-value from the spline model with the log-fold change from a model with a linear fit. The former allows for non-linear trends during the significance calculations, while the latter provides an interpretable summary of the change (i.e., does expression tend to go up or down?). However, the only general way to characterize the trend is to plot the fitted values against time for each gene. I mean, how would a single statistic distinguish between (i) a trend that goes linearly, (ii) a trend that goes up early and plateaus, or (iii) a trend that is constant early and then goes up late? And that's not even considering trends that change direction over time.

Obviously, if you have non-linear trends, the p-values will be different. In the simplest case, a gene that goes up and then down will be fitted to a line with a gradient of zero (thus non-significant), but will exhibit a significant trend with the spline. For biological systems, I would trust the p-values from the spline model much more, as the assumption of linearity is dubious.

ADD COMMENT • link 5.7 years ago Aaron Lun ★ 28k

0

Entering edit mode

Thanks for your reply, Aaron.

So what I gather here is that the p-values from the spline model will help us interpret if there is any overall trend at all (without forcing linearity) but the logFC values from the simple model would determine whether that gene is, on the whole, going up/down with age?

Would you recommend binning the age groups into categorical variables?

ADD REPLY • link 5.7 years ago anupriya.dalmia • 0

0

Entering edit mode

Yes. Though obviously, if a gene steadily goes up with age and then suddenly drops down at large values, it will probably still have a positive log-fold change. Remember that the log-fold change will only tell you the direction on average.
There is an argument for a small number of bins for interpretation purposes only, e.g., to report the log-fold change of each bin from the first. This would provide a bit more information than the overall log-fold change, as you could compare different responses across different bins. However, don't use too many bins, as then you will eventually run into the same difficulties of interpretation. Also, the p-value should still be computed using the spline model; there are statistical issues with correlations within each bin if the bins are too large, and loss of residual d.f. if the bins are too small.

One approach to simplify interpretation would be to (i) find all DE genes using the spline model, (ii) cluster the genes based on their fitted values, and (iii) annotate the clusters, e.g., early responders, late responders, peaking genes. This will allow you to effectively annotate whole blocks of genes at once, which should be easier than going gene-by-gene.

ADD REPLY • link 5.7 years ago Aaron Lun ★ 28k

0

Entering edit mode

Hi Aaron,

Thanks again for your reply.
>One approach to simplify interpretation would be to (i) find all DE genes using the spline model, (ii) cluster the genes based on their fitted values, and (iii) annotate the clusters, e.g., early responders, late responders, peaking genes. This will allow you to effectively annotate whole blocks of genes at once, which should be easier than going gene-by-gene.

Here - what do you mean by clustering them according to their fitted values? Do you mean like a k-means clustering of the logged counts of the DE genes?

Also, the absolute values of logFCs from my simple model are all quite low - in the range [-0.04, 0.03] so I am not sure if this is even a worthwhile analysis. Although, I am getting high values of logFC if I use bins, which doesn't seem too promising or reliable.

ADD REPLY • link 5.7 years ago anupriya.dalmia • 0

0

Entering edit mode

You can get the matrix of fitted values with:

fit$coefficients %*% t(fit$design)

... and then apply k-means clustering (or whatever algorithm of your choice) to it. Though, I would recommend subsetting the matrix to only use DE genes - this should improve the separation between genes with differences in behaviour. Otherwise you'll probably just get a big clump of genes, as the non-DE genes "fill in" the gaps between clusters.

For a linear model, the size of the log-fold change must always be interpreted relative to the units. For example, a linear model where the covariate is expressed in years will have a quite different log-fold change compared to units in months or days. I would focus on the sign if you're going to use the log-fold change for interpretation.

ADD REPLY • link 5.7 years ago Aaron Lun ★ 28k

0

Entering edit mode

Thanks again, Aaron. Okay so I was able to cluster the DE genes based on the fitted values and also plotted the logged values of the genes as line graphs against age but do not see any specific trends. Would it be better to plot the fitted values instead?

Also, are there any other ways in which I could annotate these clusters as "early responders, late responders, peaking genes" etc without looking at them gene-by-gene, as you suggested?

I am sorry for so many questions.

ADD REPLY • link 5.7 years ago anupriya.dalmia • 0

0

Entering edit mode

It would probably be better to plot the fitted values. Obviously you should be making one plot per cluster, you won't see anything if you throw all genes together. Just take the average of the fitted values per cluster and this will tell you what the behaviour is on average for that cluster.

ADD REPLY • link 5.7 years ago Aaron Lun ★ 28k

0

Entering edit mode

Yes, I plotted different clusters separately, of course :-)

So I have lost the sample names (colnames) while running the analysis. To be able to plot these graphs, I would need the meta-information that correlates the sample name with their age. Is there a way to get this information back?

I presume the order does not change so I could just replace them with sample names but that seems slightly risky.

ADD REPLY • link 5.7 years ago anupriya.dalmia • 0