Search
Question: DESeq2 time series analysis
0
gravatar for ofonov
5 months ago by
ofonov0
ofonov0 wrote:

Hi,


I would like to do a time-series analysis in DESeq2. I have following experimental design:
Genotype: RR, SS
Time: 0, 1, 7, 20 days
There are 6 biological replicates for each time point.

I would like to find a list of DEGs between genotypes taking into account time. For that I followed an example at http://www.bioconductor.org/help/workflows/rnaseqGene/#time-course-experiments

I used following  to construct a DESeqDataSet object:

dds <- DESeqDataSetFromMatrix(countData = counts,
                              colData = colData,
                             design = ~ Genotype + Time + Genotype:Time)
> colData[sample(nrow(colData), 10), ]
# A tibble: 10 × 4
       ID Genotype   Time Replicate
   <fctr>   <fctr> <fctr>    <fctr>
1      74       SS      7        II
2      51       SS      0       III
3      95       RR      7         V
4      54       SS      0        VI
5      58       RR      0        IV
6      92       RR     20         V
7      73       SS      7         I
8      56       RR      0        II
9      81       SS      7        VI
10     61       RR      1         I

# Relevel
dds$Genotype <- relevel(dds$Genotype, ref="SS")

# Test
dds <- DESeq(dds, test="LRT", reduced = ~ Time + Genotype)
res <- results(dds)
head(res[order(res$padj),])

log2 fold change (MLE): GenotypeRR.Time20
LRT p-value: '~ Genotype + Time + Genotype:Time' vs '~ Time + Genotype'
DataFrame with 6 rows and 6 columns
                       baseMean log2FoldChange     lfcSE      stat      pvalue      padj
                      <numeric>      <numeric> <numeric> <numeric>   <numeric> <numeric>
3p_Ssa_miR_462a     136.3663390   -1.326805416 0.4574671  14.02567 0.002870424 0.7316522
3p_Ssa_miR_731      304.0221843   -0.544891735 0.2804326  11.43828 0.009577344 0.7316522
3p_Ssa_miR_7552a_2    0.7595462   -1.318844886 1.7450825  11.31013 0.010161837 0.7316522
3p_Ssa_miR_9ab     1020.8777037    0.005572198 0.1696003  12.19612 0.006740674 0.7316522
3p_Ssa_miR_9a_3     494.4791700   -0.210968702 0.1636003  14.90190 0.001902433 0.7316522
3p_Ssa_miR_nov_5     22.4156278   -1.561194915 0.5553654  12.19148 0.006755200 0.7316522

I have several questions:

  1. As one can see, the adjusted values are too high. I would like to have a possible explanation for that.
  2. I would like to make sure that I use appropriate design and test formulas for my goals.

Thank you!

ADD COMMENTlink modified 5 months ago by Michael Love15k • written 5 months ago by ofonov0
2
gravatar for Michael Love
5 months ago by
Michael Love15k
United States
Michael Love15k wrote:

That's the appropriate design and test.

Those adjusted p-values represent the fact that you've tested many genes, and it's possible to see very small p-values under the null hypothesis when you perform thousands of tests. This is a consequence of the multiple test correction methodology.

ADD COMMENTlink written 5 months ago by Michael Love15k

Thank you for the rapid reply.

What would be the best option to select top DEGs in case when adjusted p-values are high, or does it mean that this analysis is meaningless? E.g I am interested in following test, and get output with padj=0.9975788:

> res_Genotype_RR_vs_SS<-results(dds, name="Genotype_RR_vs_SS", test="Wald")
> res_Genotype_RR_vs_SS
log2 fold change (MLE): Genotype RR vs SS
Wald test p-value: Genotype RR vs SS
DataFrame with 651 rows and 6 columns
                    baseMean log2FoldChange     lfcSE        stat    pvalue      padj
                   <numeric>      <numeric> <numeric>   <numeric> <numeric> <numeric>
3p_Ssa_let_7a_2    147.10241    0.094900837 0.1566111  0.60596483 0.5445381 0.9975788
3p_Ssa_let_7cd     380.69882    0.008774026 0.1939360  0.04524186 0.9639145 0.9975788
3p_Ssa_let_7c_2     43.82362   -0.042720552 0.2137731 -0.19984060 0.8416052 0.9975788
3p_Ssa_let_7e      138.70703    0.124070509 0.1655608  0.74939528 0.4536190 0.9975788
3p_Ssa_let_7f       39.72065    0.039311862 0.1877974  0.20933126 0.8341897 0.9975788
...                      ...            ...       ...         ...       ...       ...
ssa_mir_3413_3p    78.896131     -0.3343905 0.2664436  -1.2550142 0.2094736 0.9975788
ssa_mir_3413_5p    85.564884     -0.1489334 0.2527720  -0.5892005 0.5557268 0.9975788
ssa_mir_449_b_3p   11.522972      0.2132978 0.3608338   0.5911248 0.5544368 0.9975788
ssa_mir_449_b_5p    5.467536      0.2342977 0.4776624   0.4905090 0.6237738 0.9975788
ssa_mir_451_a_2_3p  6.589232     -0.4717298 0.4739626  -0.9952892 0.3195956 0.9975788
ADD REPLYlink written 5 months ago by ofonov0

I don't think you can say anything really, because the padj column is telling you the fold changes are consistent with no difference across genotype.

ADD REPLYlink written 4 months ago by Michael Love15k

Hi Michael,

I have a follow up question. I would like to take one more variable to the model - Family: B or C, so that my colData looks like

colData[sample(nrow(colData), 10), ]
# A tibble: 10 x 5
       ID Family Genotype   Time Replicate
   <fctr> <fctr>   <fctr> <fctr>    <fctr>
 1     53      B       SS      0         5
 2     89      B       SS     20         5
 3     57      B       RR      0         3
 4     42      C       SS     20         3
 5     29      C       RR      7         2
 6     44      C       RR     20         5
 7     64      B       RR      1         3
 8     86      B       SS     20         3
 9      6      C       SS      0         3
10     10      C       RR      0         5

Hypothesis is the same - there are DE genes across time points between different genotypes.

I am not sure what would be the best formula to model the data in this case. I have thought about two options:

  1. design = ~ Genotype + Time + Family + Genotype:Time, for LTR test: reduced = ~ Time + Genotype + Family

  2. design = ~ Genotype*Time*Family, for LTR test reduced = ~ Time + Genotype + Family

Which one would suit the best the purpose? Thank you.

 

 

ADD REPLYlink modified 4 months ago • written 4 months ago by ofonov0
1

I would go with (1), which controls for family while performing the standard time series test.

ADD REPLYlink written 4 months ago by Michael Love15k

Hi, I have another question regarding time series analysis. I would like to find out difference across time points, without taking into accout Genotype and Family. In other words, hypothesis is that there is a difference between time points. Searching through the documentation and other questions, I came up with following:


dds <- DESeqDataSetFromMatrix(countData = counts,
                              colData = colData,
                             design = ~ Time + Genotype + Family)

dds <- DESeq(dds, test="LRT", reduced = ~ 1)

resultsNames(dds)

## [1] "Intercept"         "Time_1_vs_0"       "Time_7_vs_0"     
## [4] "Time_20_vs_0"      "Genotype_SS_vs_RR" "Family_C_vs_B"

# Test Time_1_vs_0
res_Time_1_vs_0_all<-results(dds, name="Time_1_vs_0", test="Wald")
head(res_Time_1_vs_0_all[order(res_Time_1_vs_0_all$padj),])

## log2 fold change (MLE): Time 1 vs 0
## Wald test p-value: Time 1 vs 0
## DataFrame with 6 rows and 6 columns
##                    baseMean log2FoldChange      lfcSE       stat
##                   <numeric>      <numeric>  <numeric>  <numeric>
## skotl_ssa_5119_3p  71.15728     -6.6272093 0.49714870 -13.330437
## 5p_Ssa_miR_19d     39.78204     -1.3800161 0.16127299  -8.557020
## 5p_Ssa_miR_222d    13.62511     -3.0916371 0.37427083  -8.260428
## 5p_Ssa_miR_25_3   100.87030     -0.7653306 0.09541316  -8.021227
## 3p_Ssa_miR_30a_2   81.12808     -0.9421497 0.12947530  -7.276675
## 3p_Ssa_miR_2184   418.65571      1.9976584 0.29516634   6.767907
##                         pvalue         padj
##                      <numeric>    <numeric>
## skotl_ssa_5119_3p 1.539932e-40 9.624573e-38
## 5p_Ssa_miR_19d    1.158227e-17 3.619458e-15
## 5p_Ssa_miR_222d   1.451516e-16 3.023992e-14
## 5p_Ssa_miR_25_3   1.046942e-15 1.635846e-13
## 3p_Ssa_miR_30a_2  3.421471e-13 4.276839e-11
## 3p_Ssa_miR_2184   1.306586e-11 1.361027e-09

I

1. Is it appropriate way of doing this analysis?

2. Does test "Time_1_vs_0" extract all the values in this case, both genotypes and famileis, so that log fold change represent

adjusted log2(all samples time 1/all samples time 0)?

I have read the ?results, but it is not entirely clear for me in this case.

3. An alternative way of analysis which I though about was not to use LTR, but do following:

dds <- DESeqDataSetFromMatrix(countData = counts,

                              colData = colData,

                             design = ~ Time + Genotype + Family)

dds <- DESeq(dds)

results(dds, name="Time_1_vs_0", test="Wald")

Is it correct way of doing this analysis as well?

 

Thank you!

 

 


 

ADD REPLYlink modified 3 months ago • written 4 months ago by ofonov0
The first LRT as you have it is testing if any of the variables in the full design explain counts. So that's not what you want. You should have family and genotype in reduced, if you are interested in focusing on any time differences. And yes, the Wald test as you have it will allow comparisons between two time points for all levels of other variables.
ADD REPLYlink written 4 months ago by Michael Love15k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.2.0
Traffic: 121 users visited in the last hour