Search
Question: Analysis Time course data using DESeq2
0
9 days ago by
zhangfei-1230 wrote:

Hi,

I'm trying to analysis a time-series RNA-seq data using DESeq2, I have 15 time points, 2 conditions, and 2 biological repeat.

I perform LRT test first and the full model is ~batch + condition + time + condition:time, while the reduce model is ~batch + condition + time. All variances are recognized as factor. I choose the threshold fdr < 0.01 and get ~5000 different expressed genes.

Then I want to map these different expressed genes at each time point and perform wald test of these different expressed genes at each time point. I use the threshold fdr < 0.01 and | log2FoldChange | > 1  but only ~2000 genes could be mapped in the time points.  Then I loose the delimitation and use the threshold only fdr < 0.01, there are ~3000 genes could be mappped in specific time points.

I want to make sure whether my method of process time-series data is right. If so, what is the biological meaning of these genes could not mapped in specific time points.

Thanks a lot!!

Fei

modified 9 days ago by Michael Love18k • written 9 days ago by zhangfei-1230
1
9 days ago by
Michael Love18k
United States
Michael Love18k wrote:

The LRT is looking for any differences at any time point and so has higher power compared to each individual time point comparison.

Below is a small example just using R's lm() function.

Notice that, while in fact all of the 10 groups have a different mean, none have a p-value < 0.0004. Meanwhile, the ANOVA has a p-value < 0.0004. So this difference in power means that you can't necessarily find which particular time point is different despite detecting it with the LRT.

> set.seed(2); y <- rnorm(50, rep(rnorm(10,0,1),each=5), 1)
> summary(lm(y ~ x))\$coefficients
Estimate Std. Error    t value     Pr(>|t|)
(Intercept) -0.5470609  0.5540166 -0.9874450 0.3293602263
x2           0.7415973  0.7834978  0.9465212 0.3495659760
x3           3.0229304  0.7834978  3.8582500 0.0004064738
x4          -0.8811522  0.7834978 -1.1246390 0.2674456775
x5           0.6814552  0.7834978  0.8697602 0.3896209633
x6          -0.1264786  0.7834978 -0.1614282 0.8725687700
x7           1.1233450  0.7834978  1.4337564 0.1594131044
x8           0.3496955  0.7834978  0.4463261 0.6577686361
x9           2.7103169  0.7834978  3.4592526 0.0013006822
x10          0.6998463  0.7834978  0.8932332 0.3770750836
> anova(lm(y ~ x))
Analysis of Variance Table

Response: y
Df Sum Sq Mean Sq F value   Pr(>F)
x          9 66.199  7.3555  4.7929 0.000232 ***
Residuals 40 61.387  1.5347
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1


Hi Michael,

Thanks for your rapid and gentle reply, it helps me a lot.

Could I understand that genes which could be detected by LRT method but not for Wald test at particular time point do have differences at different condition but they are not show different at specific time point so they could be recognized as some kind of 'background' response?

Another question I've meet while dealing with my time course RNASeq data using DESeq2 is the result of DEG is totally different if set time point as a numeric but not factor (LRT method, ~1500 genes in numeric but ~5000 genes in factor) .The case in your tutorial considers time point as factor, but I'm a little confused what DESeq2 did while deal with these two different type of variances?

Many thanks,

Fei

The way to understand how this happens is that there is more statistical power for the LRT.

Changing in R between a factor and numeric in a design formula is a pretty big difference in modeling decision. I'd recommend speaking with a local statistician to understand the different assumptions behind these.