Question

DESeq2 time series analysis

1

Entering edit mode

ofonov ▴ 10

@ofonov-10155

Last seen 6.8 years ago

Hi,

I would like to do a time-series analysis in DESeq2. I have following experimental design:
Genotype: RR, SS
Time: 0, 1, 7, 20 days
There are 6 biological replicates for each time point.

I would like to find a list of DEGs between genotypes taking into account time. For that I followed an example at http://www.bioconductor.org/help/workflows/rnaseqGene/#time-course-experiments

I used following to construct a DESeqDataSet object:

dds <- DESeqDataSetFromMatrix(countData = counts,
                              colData = colData,
                             design = ~ Genotype + Time + Genotype:Time)

> colData[sample(nrow(colData), 10), ]
# A tibble: 10 × 4
       ID Genotype   Time Replicate
   <fctr>   <fctr> <fctr>    <fctr>
1      74       SS      7        II
2      51       SS      0       III
3      95       RR      7         V
4      54       SS      0        VI
5      58       RR      0        IV
6      92       RR     20         V
7      73       SS      7         I
8      56       RR      0        II
9      81       SS      7        VI
10     61       RR      1         I

# Relevel
dds$Genotype <- relevel(dds$Genotype, ref="SS")

# Test
dds <- DESeq(dds, test="LRT", reduced = ~ Time + Genotype)
res <- results(dds)
head(res[order(res$padj),])

log2 fold change (MLE): GenotypeRR.Time20
LRT p-value: '~ Genotype + Time + Genotype:Time' vs '~ Time + Genotype'
DataFrame with 6 rows and 6 columns
                       baseMean log2FoldChange     lfcSE      stat      pvalue      padj
                      <numeric>      <numeric> <numeric> <numeric>   <numeric> <numeric>
3p_Ssa_miR_462a     136.3663390   -1.326805416 0.4574671  14.02567 0.002870424 0.7316522
3p_Ssa_miR_731      304.0221843   -0.544891735 0.2804326  11.43828 0.009577344 0.7316522
3p_Ssa_miR_7552a_2    0.7595462   -1.318844886 1.7450825  11.31013 0.010161837 0.7316522
3p_Ssa_miR_9ab     1020.8777037    0.005572198 0.1696003  12.19612 0.006740674 0.7316522
3p_Ssa_miR_9a_3     494.4791700   -0.210968702 0.1636003  14.90190 0.001902433 0.7316522
3p_Ssa_miR_nov_5     22.4156278   -1.561194915 0.5553654  12.19148 0.006755200 0.7316522

I have several questions:

As one can see, the adjusted values are too high. I would like to have a possible explanation for that.
I would like to make sure that I use appropriate design and test formulas for my goals.

Thank you!

DESEQ2 timecourse multiple time points time • 7.0k views

ADD COMMENT • link updated 6.8 years ago by Michael Love 41k • written 6.8 years ago by ofonov ▴ 10

score 3 · Accepted Answer · 2017-06-26

3

Entering edit mode

Michael Love 41k

@mikelove

Last seen 1 day ago

United States

That's the appropriate design and test.

Those adjusted p-values represent the fact that you've tested many genes, and it's possible to see very small p-values under the null hypothesis when you perform thousands of tests. This is a consequence of the multiple test correction methodology.

ADD COMMENT • link 6.8 years ago Michael Love 41k

0

Entering edit mode

Thank you for the rapid reply.

What would be the best option to select top DEGs in case when adjusted p-values are high, or does it mean that this analysis is meaningless? E.g I am interested in following test, and get output with padj=0.9975788:

> res_Genotype_RR_vs_SS<-results(dds, name="Genotype_RR_vs_SS", test="Wald")
> res_Genotype_RR_vs_SS
log2 fold change (MLE): Genotype RR vs SS
Wald test p-value: Genotype RR vs SS
DataFrame with 651 rows and 6 columns
                    baseMean log2FoldChange     lfcSE        stat    pvalue      padj
                   <numeric>      <numeric> <numeric>   <numeric> <numeric> <numeric>
3p_Ssa_let_7a_2    147.10241    0.094900837 0.1566111  0.60596483 0.5445381 0.9975788
3p_Ssa_let_7cd     380.69882    0.008774026 0.1939360  0.04524186 0.9639145 0.9975788
3p_Ssa_let_7c_2     43.82362   -0.042720552 0.2137731 -0.19984060 0.8416052 0.9975788
3p_Ssa_let_7e      138.70703    0.124070509 0.1655608  0.74939528 0.4536190 0.9975788
3p_Ssa_let_7f       39.72065    0.039311862 0.1877974  0.20933126 0.8341897 0.9975788
...                      ...            ...       ...         ...       ...       ...
ssa_mir_3413_3p    78.896131     -0.3343905 0.2664436  -1.2550142 0.2094736 0.9975788
ssa_mir_3413_5p    85.564884     -0.1489334 0.2527720  -0.5892005 0.5557268 0.9975788
ssa_mir_449_b_3p   11.522972      0.2132978 0.3608338   0.5911248 0.5544368 0.9975788
ssa_mir_449_b_5p    5.467536      0.2342977 0.4776624   0.4905090 0.6237738 0.9975788
ssa_mir_451_a_2_3p  6.589232     -0.4717298 0.4739626  -0.9952892 0.3195956 0.9975788

ADD REPLY • link 6.8 years ago ofonov ▴ 10

0

Entering edit mode

I don't think you can say anything really, because the padj column is telling you the fold changes are consistent with no difference across genotype.

ADD REPLY • link 6.8 years ago Michael Love 41k

0

Entering edit mode

Sorry I have a treatment and control over time points with two replication. I want to compare for example 2h in treatment vs 16h in control but whatever I try all comparisons are about 0 vs another time points

> coldata = data.frame(row.names=colnames(mergedd), Platform = as.factor(c(rep("T",18),rep("R",18))),Time=as.factor(c("0","0","2","2","4","4","6","6","8","8","10","10","12","12","14","14","16","16","0","0","2","2","4","4","6","6", "8","8","10","10","12","12","14","14","16","16")))
> Platform= coldata$Platform
> Time=coldata$Time
> time
[1] 0 0 16 16 14 14 12 12 10 10 8 8 6 6 4 4 2 2 0 2 4 6 8 10 12 14 16 0 2 4 6 8 10 12 14 16
Levels: 0 10 12 14 16 2 4 6 8
> class(time)
[1] "factor"
> Platform
[1] T T T T T T T T T T T T T T T T T T R R R R R R R R R R R R R R R R R R
Levels: R T
> class(Platform)
[1] "factor"
> dds <- DESeqDataSetFromMatrix(countData=mergedd, colData= coldata ,design = ~ Time + Platform)
> design(dds) = ~ Time + Platform + Time:Platform
> dds = DESeq(dds, test="Wald")
estimating size factors
estimating dispersions
gene-wise dispersion estimates
mean-dispersion relationship
final dispersion estimates
fitting model and testing
> resultsNames(dds)
[1] "Intercept" "Time_10_vs_0" "Time_12_vs_0" "Time_14_vs_0" "Time_16_vs_0" "Time_2_vs_0"
[7] "Time_4_vs_0" "Time_6_vs_0" "Time_8_vs_0" "Platform_T_vs_R" "Time10.PlatformT" "Time12.PlatformT"
[13] "Time14.PlatformT" "Time16.PlatformT" "Time2.PlatformT" "Time4.PlatformT" "Time6.PlatformT" "Time8.PlatformT"
>

May you please consider why I can't obtained another comparisons?

Thank you for any help

ADD REPLY • link 5.9 years ago AZ ▴ 30

0

Entering edit mode

Can you start a new post? I think it will be easier to answer this way.

ADD REPLY • link 5.9 years ago Michael Love 41k

0

Entering edit mode

Hi Michael,

I have a follow up question. I would like to take one more variable to the model - Family: B or C, so that my colData looks like

colData[sample(nrow(colData), 10), ]
# A tibble: 10 x 5
       ID Family Genotype   Time Replicate
   <fctr> <fctr>   <fctr> <fctr>    <fctr>
 1     53      B       SS      0         5
 2     89      B       SS     20         5
 3     57      B       RR      0         3
 4     42      C       SS     20         3
 5     29      C       RR      7         2
 6     44      C       RR     20         5
 7     64      B       RR      1         3
 8     86      B       SS     20         3
 9      6      C       SS      0         3
10     10      C       RR      0         5

Hypothesis is the same - there are DE genes across time points between different genotypes.

I am not sure what would be the best formula to model the data in this case. I have thought about two options:

design = ~ Genotype + Time + Family + Genotype:Time, for LTR test: reduced = ~ Time + Genotype + Family
design = ~ Genotype*Time*Family, for LTR test reduced = ~ Time + Genotype + Family

Which one would suit the best the purpose? Thank you.

ADD REPLY • link 6.8 years ago ofonov ▴ 10

2

Entering edit mode

I would go with (1), which controls for family while performing the standard time series test.

ADD REPLY • link 6.8 years ago Michael Love 41k

0

Entering edit mode

Hi, I have another question regarding time series analysis. I would like to find out difference across time points, without taking into accout Genotype and Family. In other words, hypothesis is that there is a difference between time points. Searching through the documentation and other questions, I came up with following:

dds <- DESeqDataSetFromMatrix(countData = counts, colData = colData, design = ~ Time + Genotype + Family)

dds <- DESeq(dds, test="LRT", reduced = ~ 1)

resultsNames(dds)

## [1] "Intercept" "Time_1_vs_0" "Time_7_vs_0" ## [4] "Time_20_vs_0" "Genotype_SS_vs_RR" "Family_C_vs_B"

# Test Time_1_vs_0 res_Time_1_vs_0_all<-results(dds, name="Time_1_vs_0", test="Wald") head(res_Time_1_vs_0_all[order(res_Time_1_vs_0_all$padj),])

## log2 fold change (MLE): Time 1 vs 0 ## Wald test p-value: Time 1 vs 0 ## DataFrame with 6 rows and 6 columns ## baseMean log2FoldChange lfcSE stat ## <numeric> <numeric> <numeric> <numeric> ## skotl_ssa_5119_3p 71.15728 -6.6272093 0.49714870 -13.330437 ## 5p_Ssa_miR_19d 39.78204 -1.3800161 0.16127299 -8.557020 ## 5p_Ssa_miR_222d 13.62511 -3.0916371 0.37427083 -8.260428 ## 5p_Ssa_miR_25_3 100.87030 -0.7653306 0.09541316 -8.021227 ## 3p_Ssa_miR_30a_2 81.12808 -0.9421497 0.12947530 -7.276675 ## 3p_Ssa_miR_2184 418.65571 1.9976584 0.29516634 6.767907 ## pvalue padj ## <numeric> <numeric> ## skotl_ssa_5119_3p 1.539932e-40 9.624573e-38 ## 5p_Ssa_miR_19d 1.158227e-17 3.619458e-15 ## 5p_Ssa_miR_222d 1.451516e-16 3.023992e-14 ## 5p_Ssa_miR_25_3 1.046942e-15 1.635846e-13 ## 3p_Ssa_miR_30a_2 3.421471e-13 4.276839e-11 ## 3p_Ssa_miR_2184 1.306586e-11 1.361027e-09

I

1. Is it appropriate way of doing this analysis?

2. Does test "Time_1_vs_0" extract all the values in this case, both genotypes and famileis, so that log fold change represent

adjusted log2(all samples time 1/all samples time 0)?

I have read the ?results, but it is not entirely clear for me in this case.

3. An alternative way of analysis which I though about was not to use LTR, but do following:

dds <- DESeqDataSetFromMatrix(countData = counts,

colData = colData,

design = ~ Time + Genotype + Family)

dds <- DESeq(dds)

results(dds, name="Time_1_vs_0", test="Wald")

Is it correct way of doing this analysis as well?

Thank you!

ADD REPLY • link 6.8 years ago • updated 6.7 years ago ofonov ▴ 10

1

Entering edit mode

The first LRT as you have it is testing if any of the variables in the full design explain counts. So that's not what you want. You should have family and genotype in reduced, if you are interested in focusing on any time differences. And yes, the Wald test as you have it will allow comparisons between two time points for all levels of other variables.

ADD REPLY • link 6.8 years ago Michael Love 41k