Hi,
I would like to do a time-series analysis in DESeq2. I have following experimental design:
Genotype: RR, SS
Time: 0, 1, 7, 20 days
There are 6 biological replicates for each time point.
I would like to find a list of DEGs between genotypes taking into account time. For that I followed an example at http://www.bioconductor.org/help/workflows/rnaseqGene/#time-course-experiments
I used following to construct a DESeqDataSet object:
dds <- DESeqDataSetFromMatrix(countData = counts, colData = colData, design = ~ Genotype + Time + Genotype:Time)
> colData[sample(nrow(colData), 10), ] # A tibble: 10 × 4 ID Genotype Time Replicate <fctr> <fctr> <fctr> <fctr> 1 74 SS 7 II 2 51 SS 0 III 3 95 RR 7 V 4 54 SS 0 VI 5 58 RR 0 IV 6 92 RR 20 V 7 73 SS 7 I 8 56 RR 0 II 9 81 SS 7 VI 10 61 RR 1 I # Relevel dds$Genotype <- relevel(dds$Genotype, ref="SS") # Test dds <- DESeq(dds, test="LRT", reduced = ~ Time + Genotype) res <- results(dds) head(res[order(res$padj),]) log2 fold change (MLE): GenotypeRR.Time20 LRT p-value: '~ Genotype + Time + Genotype:Time' vs '~ Time + Genotype' DataFrame with 6 rows and 6 columns baseMean log2FoldChange lfcSE stat pvalue padj <numeric> <numeric> <numeric> <numeric> <numeric> <numeric> 3p_Ssa_miR_462a 136.3663390 -1.326805416 0.4574671 14.02567 0.002870424 0.7316522 3p_Ssa_miR_731 304.0221843 -0.544891735 0.2804326 11.43828 0.009577344 0.7316522 3p_Ssa_miR_7552a_2 0.7595462 -1.318844886 1.7450825 11.31013 0.010161837 0.7316522 3p_Ssa_miR_9ab 1020.8777037 0.005572198 0.1696003 12.19612 0.006740674 0.7316522 3p_Ssa_miR_9a_3 494.4791700 -0.210968702 0.1636003 14.90190 0.001902433 0.7316522 3p_Ssa_miR_nov_5 22.4156278 -1.561194915 0.5553654 12.19148 0.006755200 0.7316522
I have several questions:
- As one can see, the adjusted values are too high. I would like to have a possible explanation for that.
- I would like to make sure that I use appropriate design and test formulas for my goals.
Thank you!
Thank you for the rapid reply.
What would be the best option to select top DEGs in case when adjusted p-values are high, or does it mean that this analysis is meaningless? E.g I am interested in following test, and get output with padj=0.9975788:
I don't think you can say anything really, because the padj column is telling you the fold changes are consistent with no difference across genotype.
Sorry I have a treatment and control over time points with two replication. I want to compare for example 2h in treatment vs 16h in control but whatever I try all comparisons are about 0 vs another time points
> coldata = data.frame(row.names=colnames(mergedd), Platform = as.factor(c(rep("T",18),rep("R",18))),Time=as.factor(c("0","0","2","2","4","4","6","6","8","8","10","10","12","12","14","14","16","16","0","0","2","2","4","4","6","6", "8","8","10","10","12","12","14","14","16","16")))
> Platform= coldata$Platform
> Time=coldata$Time
> time
[1] 0 0 16 16 14 14 12 12 10 10 8 8 6 6 4 4 2 2 0 2 4 6 8 10 12 14 16 0 2 4 6 8 10 12 14 16
Levels: 0 10 12 14 16 2 4 6 8
> class(time)
[1] "factor"
> Platform
[1] T T T T T T T T T T T T T T T T T T R R R R R R R R R R R R R R R R R R
Levels: R T
> class(Platform)
[1] "factor"
> dds <- DESeqDataSetFromMatrix(countData=mergedd, colData= coldata ,design = ~ Time + Platform)
> design(dds) = ~ Time + Platform + Time:Platform
> dds = DESeq(dds, test="Wald")
estimating size factors
estimating dispersions
gene-wise dispersion estimates
mean-dispersion relationship
final dispersion estimates
fitting model and testing
> resultsNames(dds)
[1] "Intercept" "Time_10_vs_0" "Time_12_vs_0" "Time_14_vs_0" "Time_16_vs_0" "Time_2_vs_0"
[7] "Time_4_vs_0" "Time_6_vs_0" "Time_8_vs_0" "Platform_T_vs_R" "Time10.PlatformT" "Time12.PlatformT"
[13] "Time14.PlatformT" "Time16.PlatformT" "Time2.PlatformT" "Time4.PlatformT" "Time6.PlatformT" "Time8.PlatformT"
>
May you please consider why I can't obtained another comparisons?
Thank you for any help
Can you start a new post? I think it will be easier to answer this way.
Hi Michael,
I have a follow up question. I would like to take one more variable to the model - Family: B or C, so that my colData looks like
Hypothesis is the same - there are DE genes across time points between different genotypes.
I am not sure what would be the best formula to model the data in this case. I have thought about two options:
design = ~ Genotype + Time + Family + Genotype:Time
, for LTR test:reduced = ~ Time + Genotype + Family
design = ~ Genotype*Time*Family
, for LTR testreduced = ~ Time + Genotype + Family
Which one would suit the best the purpose? Thank you.
I would go with (1), which controls for family while performing the standard time series test.
Hi, I have another question regarding time series analysis. I would like to find out difference across time points, without taking into accout Genotype and Family. In other words, hypothesis is that there is a difference between time points. Searching through the documentation and other questions, I came up with following:
dds <- DESeqDataSetFromMatrix(countData = counts,
colData = colData,
design = ~ Time + Genotype + Family)
dds <- DESeq(dds, test="LRT", reduced = ~ 1)
resultsNames(dds)
## [1] "Intercept" "Time_1_vs_0" "Time_7_vs_0"
## [4] "Time_20_vs_0" "Genotype_SS_vs_RR" "Family_C_vs_B"
# Test Time_1_vs_0
res_Time_1_vs_0_all<-results(dds, name="Time_1_vs_0", test="Wald")
head(res_Time_1_vs_0_all[order(res_Time_1_vs_0_all$padj),])
## log2 fold change (MLE): Time 1 vs 0
## Wald test p-value: Time 1 vs 0
## DataFrame with 6 rows and 6 columns
## baseMean log2FoldChange lfcSE stat
## <numeric> <numeric> <numeric> <numeric>
## skotl_ssa_5119_3p 71.15728 -6.6272093 0.49714870 -13.330437
## 5p_Ssa_miR_19d 39.78204 -1.3800161 0.16127299 -8.557020
## 5p_Ssa_miR_222d 13.62511 -3.0916371 0.37427083 -8.260428
## 5p_Ssa_miR_25_3 100.87030 -0.7653306 0.09541316 -8.021227
## 3p_Ssa_miR_30a_2 81.12808 -0.9421497 0.12947530 -7.276675
## 3p_Ssa_miR_2184 418.65571 1.9976584 0.29516634 6.767907
## pvalue padj
## <numeric> <numeric>
## skotl_ssa_5119_3p 1.539932e-40 9.624573e-38
## 5p_Ssa_miR_19d 1.158227e-17 3.619458e-15
## 5p_Ssa_miR_222d 1.451516e-16 3.023992e-14
## 5p_Ssa_miR_25_3 1.046942e-15 1.635846e-13
## 3p_Ssa_miR_30a_2 3.421471e-13 4.276839e-11
## 3p_Ssa_miR_2184 1.306586e-11 1.361027e-09
I
1. Is it appropriate way of doing this analysis?
2. Does test
"Time_1_vs_0"
extract all the values in this case, both genotypes and famileis, so that log fold change representadjusted log2(all samples time 1/all samples time 0)?
I have read the ?results, but it is not entirely clear for me in this case.
3. An alternative way of analysis which I though about was not to use LTR, but do following:
dds <- DESeqDataSetFromMatrix(countData = counts,
colData = colData,
design = ~ Time + Genotype + Family)
dds <- DESeq(dds)
results(dds, name="Time_1_vs_0", test="Wald")
Is it correct way of doing this analysis as well?
Thank you!