I have 74 samples with log normalised gene counts. I want to see their expression pattern changes with age (range 2-72 years). I am not binning the ages and using them as a continuous variable.
This is the script I used to get p-values using splines:
require(splines)
x<-ns(pheno_UMary$Age, df=5)
design<-model.matrix(~x)
fit<-lmFit(counts_UMary, design = design)
fit<- eBayes(fit, robust = TRUE)
results<-topTable(fit coef = 2:ncol(design), n=Inf, sort.by = "none")
head(View(results))
I am a little confused about how to interpret the x1-5 coefficients obtained from the splines.
I then run the second model is a simpler model using age directly as a covariate and use the logFC from that. Since this model assumes linearity, I will consider the adjusted p-values from the first model.
Also, even though we "sort.by = none" for both, the order is different so it is quite difficult to compare the results. And the p-values are extremely different as well.
Thanks for your reply, Aaron.
So what I gather here is that the p-values from the spline model will help us interpret if there is any overall trend at all (without forcing linearity) but the logFC values from the simple model would determine whether that gene is, on the whole, going up/down with age?
Would you recommend binning the age groups into categorical variables?
One approach to simplify interpretation would be to (i) find all DE genes using the spline model, (ii) cluster the genes based on their fitted values, and (iii) annotate the clusters, e.g., early responders, late responders, peaking genes. This will allow you to effectively annotate whole blocks of genes at once, which should be easier than going gene-by-gene.
Hi Aaron,
Thanks again for your reply.
>One approach to simplify interpretation would be to (i) find all DE genes using the spline model, (ii) cluster the genes based on their fitted values, and (iii) annotate the clusters, e.g., early responders, late responders, peaking genes. This will allow you to effectively annotate whole blocks of genes at once, which should be easier than going gene-by-gene.
Here - what do you mean by clustering them according to their fitted values? Do you mean like a k-means clustering of the logged counts of the DE genes?
Also, the absolute values of logFCs from my simple model are all quite low - in the range [-0.04, 0.03] so I am not sure if this is even a worthwhile analysis. Although, I am getting high values of logFC if I use bins, which doesn't seem too promising or reliable.
You can get the matrix of fitted values with:
... and then apply k-means clustering (or whatever algorithm of your choice) to it. Though, I would recommend subsetting the matrix to only use DE genes - this should improve the separation between genes with differences in behaviour. Otherwise you'll probably just get a big clump of genes, as the non-DE genes "fill in" the gaps between clusters.
For a linear model, the size of the log-fold change must always be interpreted relative to the units. For example, a linear model where the covariate is expressed in years will have a quite different log-fold change compared to units in months or days. I would focus on the sign if you're going to use the log-fold change for interpretation.
Thanks again, Aaron. Okay so I was able to cluster the DE genes based on the fitted values and also plotted the logged values of the genes as line graphs against age but do not see any specific trends. Would it be better to plot the fitted values instead?
Also, are there any other ways in which I could annotate these clusters as "early responders, late responders, peaking genes" etc without looking at them gene-by-gene, as you suggested?
I am sorry for so many questions.
It would probably be better to plot the fitted values. Obviously you should be making one plot per cluster, you won't see anything if you throw all genes together. Just take the average of the fitted values per cluster and this will tell you what the behaviour is on average for that cluster.
Yes, I plotted different clusters separately, of course :-)
So I have lost the sample names (colnames) while running the analysis. To be able to plot these graphs, I would need the meta-information that correlates the sample name with their age. Is there a way to get this information back?
I presume the order does not change so I could just replace them with sample names but that seems slightly risky.