Scenario: I have a dataset with 3 replicates for each of 12 treatments (say A...L), and I am interested in identifying DEGs occurring between particular treatment pairs (e.g. A vs B).
Observation: If I create the dds from the whole dataset and choose a specific contrast, I get different baseMean, lfcSE, pvalue and padj values than I do if I create the dds using only data from the pair of interest.
CONTRAST SPECIFIED FROM WITHIN DDS OF ALL TREATMENTS:
results(dds, contrast=c("condition","A","B"))
log2 fold change (MLE): condition A vs B
Wald test p-value: condition A vs B
DataFrame with 19189 rows and 6 columns
baseMean log2FoldChange lfcSE stat pvalue padj
<numeric> <numeric> <numeric> <numeric> <numeric> <numeric>
ATCG00020 5873.22144 2.926121 1.021244 2.86525 0.004166788 0.402878
ATCG00040 55.49366 2.556820 0.772141 3.31134 0.000928506 0.321590
ATCG00050 4.24306 0.911677 1.097440 0.83073 0.406125929 0.895508
ATCG00070 129.27108 1.041406 0.570771 1.82456 0.068067257 0.722884
ATCG00080 282.81681 1.066214 0.789473 1.35054 0.176842976 0.811457
...
RESULT FROM DATA INPUT ONLY FOR TREATMENT A and B:
results(AB_dds)
log2 fold change (MLE): condition A vs B
Wald test p-value: condition A vs B
DataFrame with 17890 rows and 6 columns
baseMean log2FoldChange lfcSE stat pvalue padj
<numeric> <numeric> <numeric> <numeric> <numeric> <numeric>
ATCG00020 4869.7243 2.92651 0.331037 8.84042 9.53567e-19 1.44065e-14
ATCG00040 65.4715 2.55531 0.572043 4.46698 7.93303e-06 2.66338e-03
ATCG00070 134.0186 1.03323 0.320381 3.22499 1.25976e-03 1.00171e-01
ATCG00080 313.3158 1.06132 0.373712 2.83994 4.51217e-03 2.24243e-01
ATCG00120 224.4218 2.09119 0.363608 5.75122 8.86001e-09 6.37414e-06
...
I'm slightly surprised by this difference.
In general, when looking at a particular contrast, what effect are the other samples outside of the contrast having on the results, and why?
Replying to my own question! I think this part of the FAQ explains why, in our case, adding more groups (including all conditions) makes the model worse by inflating the per-gene dispersion estimate. The PCA plot indeed shows groups with high within-group variability. http://bioconductor.org/packages/release/bioc/vignettes/DESeq2/inst/doc/DESeq2.html#if-i-have-multiple-groups-should-i-run-all-together-or-split-into-pairs-of-groups
Hi. Precisely your PCA plot is the guiding plot whether to split by group or take all together. I too had a similar discussion here:Strategy to perform DESeq, DEXSeq workflow
Hope this will help. If you have a lot of intrasample variability (biological replicates differ drastically), you should split by groups.