Timely differentially expressed genes
1
0
Entering edit mode
triZZla • 0
@ivozeller-12603
Last seen 9 months ago
Germany

Dear community,

I am analyzing microarray data of patients followed from birth up to an age of 3 years. The samples were taken at various age points for each child.

I want to identify non constant gene expression over time within these children, so genes that are differentially expressed over time.

For this I have applied a mixed effects model with splines(df=3) in limma adjusting for sex as a covariate.

X <- ns(age, df=3) 

The topTable function results in various differentially expressed genes, I have set the coef parameter to coef=1:3, including the columns X1,X2 X3.

My question is how are the log2foldchanges in the result table of topTable() interpreted?

Many thanks in advance

timecoursedata limma • 1.7k views
ADD COMMENT
0
Entering edit mode

Are you sure that coef=1:3 is correct? Usually coef=1 is the intercept.

ADD REPLY
0
Entering edit mode

Thanks for your help! Actually I am not so sure, as from my background I am not particular strong in mathematics. I have fitted the model without an intercept and used the columns of the output matrix of the ns() function as coef=1:3

ADD REPLY
0
Entering edit mode

No, that is wrong. The ns() function assumes that an intercept is included in the model, as is the default in R. Why would you remove it?

ADD REPLY
0
Entering edit mode

The default in the ns() function is intercept = FALSE, but the lm function adds than an intercept, right ? I played a bit with the example that was given by James MacDonald to get a better understanding and feeling for the analysis. Would the following approach be correct ?

X <- ns(age, df = 3)
design <- model.matrix( ~  X + gender , metadata )
fit <- lmFit(expressionMatrix, design=design,
         block=patient_id, correlation=cor)
fitE <- eBayes(fit)
topTable(fitE, n=Inf, coef = 2:4, sort.by="B",
               adjust.method ="BH", p.value = 0.05, lfc = 0.14)

In this case what would the lfc threshold mean ? It is not really interpretable as far as I have understood from the answer below because it refers to coefficients in the spline model and not the difference between two conditions ?

ADD REPLY
0
Entering edit mode

Would it be reasonable to take the maximum log2 foldchange within each predicted trajectory as a measure of how strong a gene is differentially expressed and also as a measure to cut off genes from the list that show very little variation in gene expression over time ?

ADD REPLY
1
Entering edit mode

You mean maximum logFC between any pair of times? Yes, you could do that. I do not generally recommend ad hoc cutoffs like that however. Usually it is better to simply rank the genes by p-value. The empirical Bayes prcoedure of limma already gives you a lot of protection against identifying small fold changes as DE so adding an ad hoc fold-change cutoff is usually unnecessary.

ADD REPLY
0
Entering edit mode

Thanks. If I consider all significant genes without any maximum logFC criteria, the gene with the lowest expression change over time, has about 4 % percent maximum change over time ~0.06 logFC. Can this still be physiological relevant ?

ADD REPLY
0
Entering edit mode

I have never seen such a small logFC coming up as significant in any of my own analyses. You must have a large number of samples and, if you do, then applying a max logFC cutoff could be reasonable. Unfortunately we don't have a TREAT method for spline trends.

ADD REPLY
0
Entering edit mode

Thanks, but what is the lowest that you have seen in such circumstances ? And did you have systematically calculated it actually or just read off for some genes from the predicted curves? So far I have only worked with treatment experiments, and I know in these cases the foldchanges are often enormous, but I am lacking any feeling for genes that are just observed over age without any treatment. Regarding the change I don't know if a 4 % change can have physiological effects. If I think of a guy who is 180 cm and a guy that is 187.2 cm, the difference is quite obvious, I don't know how much this translates to the world of transcriptomics. The biggest lFC in my data is around 1, even this I consider quite low from what I have observed in previous treatment experiments

ADD REPLY
1
Entering edit mode
@james-w-macdonald-5106
Last seen 2 hours ago
United States

The short answer is that the logFC isn't interpreted. You have found genes that change over time in a rather unspecified way. Usually what I do next is to use kmeans clustering to then identify groups of genes that react similarly over time. You can then do something like a spaghetti plot to show the overall pattern for each group identified and interpret based on those plots. Here is an unrelated example of a spaghetti plot, using kmeans to identify genes that vary due to genotype.

enter image description here

ADD COMMENT
0
Entering edit mode

Took a bit to get the image posted, as imgur no longer provides direct links...

ADD REPLY
0
Entering edit mode

Thanks for your answer! I have done something very similar with the degPatterns() function. But I would be interessted to assess from the topTable output, if a particular gene is increasing more over time compared to another gene without plotting the particular genes. The genes that fall into one cluster according to degPatterns() have very similar logFC value patterns for X1, X2 und X3, but I am not sure how to interpret these. I have defined a gene as differentially expressed when padj <=0.05 and the absolute logFC>=0.14 in at least one column.

ADD REPLY
1
Entering edit mode

You don't interpret the logFC for a spline fit, so you cannot use the output from topTable to assess anything. As an example, try doing this:

> library(splines)
> example("ns")
> summary(fm1)

Call:
lm(formula = weight ~ ns(height, df = 5), data = women)

Residuals:
     Min       1Q   Median       3Q      Max 
-0.38333 -0.12585  0.07083  0.15401  0.30426 

Coefficients:
                    Estimate Std. Error t value Pr(>|t|)    
(Intercept)         114.7447     0.2338  490.88  < 2e-16 ***
ns(height, df = 5)1  15.9474     0.3699   43.12 9.69e-12 ***
ns(height, df = 5)2  25.1695     0.4323   58.23 6.55e-13 ***
ns(height, df = 5)3  33.2582     0.3541   93.93 8.91e-15 ***
ns(height, df = 5)4  50.7894     0.6062   83.78 2.49e-14 ***
ns(height, df = 5)5  45.0363     0.2784  161.75  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.2645 on 9 degrees of freedom
Multiple R-squared:  0.9998,    Adjusted R-squared:  0.9997 
F-statistic:  9609 on 5 and 9 DF,  p-value: < 2.2e-16

We have an intercept and estimates for each of the five knots in the spline. And the plot that is also produced by the example code shows a line with a slight upward curve. Would you be able to look at the coefficients for this model and say anything about the underlying relationship between height and weight, in particular that there is a curve and in which direction it curves? I can't. But I can sure tell you that the spline fit is better.

> anova(fm1, update(fm1, .~ 1 + height))
Analysis of Variance Table

Model 1: weight ~ ns(height, df = 5)
Model 2: weight ~ height
  Res.Df     RSS Df Sum of Sq      F   Pr(>F)    
1      9  0.6298                                 
2     13 30.2333 -4   -29.604 105.76 1.47e-07 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

And I can also provide a plot with the spline fit and the linear fit and provide the above ANOVA table to show that the former fits better than the latter. But the interpretation comes only from the plot, rather than the output.

ADD REPLY

Login before adding your answer.

Traffic: 577 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6