This is a crosspost from my old Biostars post - unfortunately no luck there yet.. I do consider this more of a theoretical question though, which is why I initially posted it on Biostars.
Hi! I have been battling with a multifaceted problem for months now and I can't figure out how to solve/clarify it, I would be extremely grateful for any comments/suggestions/opinions/criticism. In short, I can't figure out if I should use all my data or half my data in DESeq2 for calculating the size factors and estimating dispersion etc. i.e. when are the samples 'too different' and including both of them would unfavourably skew these estimations when fitting the model and result in a bunch of TPs and FPs?
The closest I have found to something similar in the vignette is http://bioconductor.org/packages/release/bioc/vignettes/DESeq2/inst/doc/DESeq2.html#if-i-have-multiple-groups-should-i-run-all-together-or-split-into-pairs-of-groups. And some discussions on biostars too, but none quite explaining to the level that would put me at ease.
Experimental setup: 4 sorted cell populations (termed as different cell stages) in two replicates for 2 clones of the same PBMC type obtained from different individuals = 16 samples sequenced together
Goals:
- Identify differentially expressed genes between pairwise comparisons of different cell states in clone 1.
- Identify differentially expressed genes between pairwise comparisons of different cell states in clone 2.
- Identify differentially expressed genes between pairwise comparisons of different cell states in both clones. (Either through simply taking the overlap of the previous or attack it through specifying the design matrix to account for "clone" difference - I would hope both would lead to similar conclusions...)
Problems: As these cells come from different individuals there is biological variation between them, which is of interest. However, mutually 'changing' genes are of interest as well. There are multiple genes that are only expressed in clone1 or in clone2, which DESeq2 won't include in normalization, but could be important for a clone-specific sample normalization and other estimates (I would think?). To try my best and visually aid what I am taking about:
#This just includes every bit of information I have
sample_info
# condition clones replicate clone_condition
# clone1_stage_1_rep_1 stage1 clone1 1 clone1_stage1
# clone1_stage_2_rep_1 stage2 clone1 1 clone1_stage2
# clone1_stage_3_rep_1 stage3 clone1 1 clone1_stage3
# clone1_stage_4_rep_1 stage4 clone1 1 clone1_stage4
# clone1_stage_1_rep_2 stage1 clone1 2 clone1_stage1
# clone1_stage_2_rep_2 stage2 clone1 2 clone1_stage2
# clone1_stage_3_rep_2 stage3 clone1 2 clone1_stage3
# clone1_stage_4_rep_2 stage4 clone1 2 clone1_stage4
# clone2_stage_1_rep_1 stage1 clone2 1 clone2_stage1
# clone2_stage_2_rep_1 stage2 clone2 1 clone2_stage2
# clone2_stage_3_rep_1 stage3 clone2 1 clone2_stage3
# clone2_stage_4_rep_1 stage4 clone2 1 clone2_stage4
# clone2_stage_1_rep_2 stage1 clone2 2 clone2_stage1
# clone2_stage_2_rep_2 stage2 clone2 2 clone2_stage2
# clone2_stage_3_rep_2 stage3 clone2 2 clone2_stage3
# clone2_stage_4_rep_2 stage4 clone2 2 clone2_stage4
data_set <- DESeqDataSetFromMatrix(countData = DF,
colData = sample_info,
design = ~ clone_condition)
Side-note, would having an intercept 0 be more appropriate for the pairwise comparisons? design(dds) <- formula(~ 0 + clone_condition)
PCA plot:
It is quite clear the PC1 explains the differences between the clones themselves and PC2 explains the difference between stages, but the first 2 goals are looking at intra-clonal effects anyway, so I won't be directly comparing these clones with each other at this stage (e.g. I might ask what is DE between Stage2 and Stage3 in Clone1, irrelevant of clone2). That said, is it then appropriate to even allow DESeq2 to gather information across both clones if I am looking at intraclonal pairwise comparisons? Or would it be more appropriate to model the clones separately for goals 1-2 and together for goal 3?
It gets even more confusing for me when I try either approach and e.g. for state2 vs state1 contrast, clone1 gains low p-value gene expression differences (more genes with low p) when compared to running it alone through the pipeline, while clone2 loses low p-values when run together as opposed to running it alone from the start (shown below).
Sure, some of this is the effect of independent filtering and multiple-testing correction, I won't know if these are TP or FP, but manually checking some it seemed e.g. that genes with 0-counts in clone2, but showing expression (as well as 'expected dynamics' in expression differences) in clone1, were non-significant when ran together, but significant in clone1 when ran separately (not shown here). Indeed, I am not talking of huge LFC's (and not trying to massage relevant p-values), but those genes were biologically interesting (for that clone) and potentially something to look into further down the line through separate means, which might otherwise be missed with a general p-value filter and I don't think it is feasible to go through all of these manually....but does make me question what else could I be missing by having a non-ideal approach with analysis setup.
> clone1_stage2_vs_stage1 <- results(dds, contrast=c("clone_condition", "clone1_stage2", "clone1_stage1"))
> clone2_stage2_vs_stage1 <- results(dds, contrast=c("clone_condition", "clone2_stage2", "clone2_stage1"))
> clone1_stage2_vs_stage1_sep <- results(clone1_dds, contrast=c("clone_condition", "clone1_stage2", "clone1_stage1"))
> clone2_stage2_vs_stage1_sep <- results(clone2_dds, contrast=c("clone_condition", "clone2_stage2", "clone2_stage1"))
### These first two are for when both clones are used in data normalization by DESeq2
> summary(clone1_stage2_vs_stage1)
out of 24656 with nonzero total read count
adjusted p-value < 0.1
LFC > 0 (up) : 4318, 18%
LFC < 0 (down) : 3786, 15%
outliers [1] : 0, 0%
low counts [2] : 4781, 19%
(mean count < 3)
[1] see 'cooksCutoff' argument of ?results
[2] see 'independentFiltering' argument of ?results
> summary(clone2_stage2_vs_stage1)
out of 24656 with nonzero total read count
adjusted p-value < 0.1
LFC > 0 (up) : 5243, 21%
LFC < 0 (down) : 4562, 19%
outliers [1] : 0, 0%
low counts [2] : 4781, 19%
(mean count < 3)
[1] see 'cooksCutoff' argument of ?results
[2] see 'independentFiltering' argument of ?results
## These second two are when clones are split from the start and DESeq2 does not 'share' information across clones
> summary(clone1_stage2_vs_stage1_sep)
out of 23616 with nonzero total read count
adjusted p-value < 0.1
LFC > 0 (up) : 3735, 16%
LFC < 0 (down) : 3086, 13%
outliers [1] : 0, 0%
low counts [2] : 4568, 19%
(mean count < 2)
[1] see 'cooksCutoff' argument of ?results
[2] see 'independentFiltering' argument of ?results
> summary(clone2_stage2_vs_stage1_sep)
out of 24228 with nonzero total read count
adjusted p-value < 0.1
LFC > 0 (up) : 5811, 24%
LFC < 0 (down) : 4793, 20%
outliers [1] : 0, 0%
low counts [2] : 3755, 15%
(mean count < 2)
[1] see 'cooksCutoff' argument of ?results
[2] see 'independentFiltering' argument of ?results
I don't know enough to think of any useful plots or metrics to help me decide, I did plot the dispersion plots, but I am not sure if I am overestimating the dispersion when running them together resulting in less TPs?
As it has been a while now I can say that I carried on modeling them separately (based entirely on my underdeveloped intuition), I also say this in the comments on Biostars, but I do still have restless nights with this haunting me. And I get the sense from my Biostar post that I might not be the only one. Please help.
Thank you so much for your reply and clarifying!
This is excellent news, now I have some confidence that I did not mess up right in the beginning of my analysis!
Can I also ask a few follow up questions, because I think I still lack the proper insight, but I do want to grasp these concepts (at least to some comfortable level):
a) Ideally I would want a 'sharper' drop in the dispersion curve and also a low dispersion value (where the curve starts leveling off)? Or the drop doesn't matter that much, the dispersion value is more important?
b) And that would also mean that a gene's expression does not differ a lot in the same group of samples? (and in general we don't expect most genes to change much in their expression in the same sample group and this is easier to assess when the mean is high, the right side of the plot)?
c) The black dots are shrunk towards this fitted red line for the final dispersion estimate - these would be the blue dots. How does the 'area' of blue dots relate to good or bad (or preferred vs non-preferred)? Does a large area/spread of blue dots mean that the final estimates are still kind of spread out and it would be better if I had a thin-close-to-the-red-line final estimates for the genes? (It probably is not this simple and has multiple caveats of over/under-estimating a gene's dispersion by shrinking it)
d) am I just over-thinking this?
a) No I wouldn't say you want the curve to look any particular way. Not sure I have any insight here toward this question.
b) Lower dispersion means less within group variability
c) Again, it's not really good or bad, but just the amount of within group variability. The asymptote -- flat part of the line -- gives you a sense of the within group variability. In particular if you take the sqrt of the dispersion, this is the typical coefficient of variation within group (SD/mean).
Great! Thanks again for taking the time to reply - it has been very helpful.
I have same problem. i have 3 sample , one with WT and 2 others are treatment1 and treatment2. It is confusing for me when I try either approach and e.g. multiple group comparison for WT vs treatment1 contrast i am getting deseq2 result diiferent ,
dds2<- DESeq(dds, betaPrior = FALSE) resultsNames(dds2) [1] "Intercept" "condition_treatment1_vs_WT" "condition_treatment2_vs_WT"
res <- results(dds2, name="condition_treatment1_vs_WT", alpha = 0.05) summary(res)
-out of 26151 with nonzero total read count adjusted p-value < 0.05 LFC > 0 (up) : 2463, 9.4% LFC < 0 (down) : 2982, 11% outliers [1] : 2, 0.0076% low counts [2] : 5070, 1 9% (mean count < 5) [1] see 'cooksCutoff' argument of ?results [2] see 'independentFiltering' argument of ?results
compared with running it alone i.e. WT vs only treatment1 data, through same pipeline des <- DESeq(dds) res <- results(des, alpha = 0.05) summary(res)
-out of 24475 with nonzero total read count adjusted p-value < 0.05 LFC > 0 (up) : 2657, 11% LFC < 0 (down) : 3140, 13% outliers [1] : 0, 0% low counts [2] : 3322, 14% (mean count < 5) 1] see 'cooksCutoff' argument of ?results [2] see 'independentFiltering' argument of ?results
Can anyone explain why and which approach is right i.e. DESeq 1 vs 1 comparison or multiple group comparison?
"It is confusing for me when I try either approach and e.g. multiple group comparison for WT vs treatment1 contrast i am getting deseq2 result diiferent"
Have you read the vignette section referenced in this thread?
See the original post above which links to the DESeq2 FAQ. It explains why results would differ based on the amount of data provided.
"Can anyone explain why and which approach is right"
Again, quoting myself from above: