Hi DESeq2 developers,
I have run DEseq2 on a dataset composed by 90 samples (18 different groups x 5 replicates each) and I am interested in the geneA that is known by literature to be highly expressed in 1 group (S04) and has zero counts in almost all the others (attached at the end of the post)
dds <- DESeqDataSetFromMatrix(countData = cts,
colData = col_final,
design = ~stage)
dds2 <- DESeq(dds)
resultsNames(dds2)
out <- results(dds2, name="stage_S04_vs_S03")
This is the result I obtain for geneA:
gene baseMean log2FoldChange lfcSE stat pvalue padj
geneA 7.6376 9.5624 4.5052 2.1225 0.03379 0.1128
GeneA is highly over expressed however the lfcSE is high and consequently the padj is not significant (if considering a threshold of .05).
I know DESeq2 has been well designed in order to handle zero counts but this result sounds strange to me and so I tried to add 1 at all the raw count values of all the genes in my dataset and I performed again the analysis obtaining the following result:
gene baseMean log2FoldChange lfcSE stat pvalue padj
geneA 15.1641 7.2097 0.6964 10.3520 4.10E-25 1.49E-22
Here, the result is highly significant.
My question is, why this difference just adding 1 to all the gene raw counts? Should I perform the DE analysis in a different way in order to make it fit better on genes with zero counts in almost all samples?
Thanks in advance.
Fred
Raw counts for geneA:
gene S01_R1 S01_R2 S01_R3 S01_R4 S01_R5 S02_R1 S02_R2 S02_R3 S02_R4 S02_R5 S03_R1 S03_R2 S03_R3 S03_R4 S03_R5 S04_R1 S04_R2 S04_R3 S04_R4 S04_R5 S05_R1 S05_R2 S05_R3 S05_R4 S05_R5 S06_R1 S06_R2 S06_R3 S06_R4 S06_R5 S07_R1 S07_R2 S07_R3 S07_R4 S07_R5 S08_R1 S08_R2 S08_R3 S08_R4 S08_R5 S09_R1 S09_R2 S09_R3 S09_R4 S09_R5 S10_R1 S10_R2 S10_R3 S10_R4 S10_R5 S11_R1 S11_R2 S11_R3 S11_R4 S11_R5 S12_R1 S12_R2 S12_R3 S12_R4 S12_R5 S13_R1 S13_R2 S13_R3 S13_R4 S13_R5 S14_R1 S14_R2 S14_R3 S14_R4 S14_R5 S15_R1 S15_R2 S15_R3 S15_R4 S15_R5 S16_R1 S16_R2 S16_R3 S16_R4 S16_R5 S17_R1 S17_R2 S17_R3 S17_R4 S17_R5 S18_R1 S18_R2 S18_R3 S18_R4 S18_R5
geneA 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 157 217 77 182 123 67 136 76 60 77 0 10 10 8 0 0 5 5 0 0 0 0 0 4 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
normalized counts for geneA:
gene S01_R1 S01_R2 S01_R3 S01_R4 S01_R5 S02_R1 S02_R2 S02_R3 S02_R4 S02_R5 S03_R1 S03_R2 S03_R3 S03_R4 S03_R5 S04_R1 S04_R2 S04_R3 S04_R4 S04_R5 S05_R1 S05_R2 S05_R3 S05_R4 S05_R5 S06_R1 S06_R2 S06_R3 S06_R4 S06_R5 S07_R1 S07_R2 S07_R3 S07_R4 S07_R5 S08_R1 S08_R2 S08_R3 S08_R4 S08_R5 S09_R1 S09_R2 S09_R3 S09_R4 S09_R5 S10_R1 S10_R2 S10_R3 S10_R4 S10_R5 S11_R1 S11_R2 S11_R3 S11_R4 S11_R5 S12_R1 S12_R2 S12_R3 S12_R4 S12_R5 S13_R1 S13_R2 S13_R3 S13_R4 S13_R5 S14_R1 S14_R2 S14_R3 S14_R4 S14_R5 S15_R1 S15_R2 S15_R3 S15_R4 S15_R5 S16_R1 S16_R2 S16_R3 S16_R4 S16_R5 S17_R1 S17_R2 S17_R3 S17_R4 S17_R5 S18_R1 S18_R2 S18_R3 S18_R4 S18_R5
geneA 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 65.87101129 89.66876161 43.13716558 95.51107793 66.68373543 47.89012269 85.70241589 53.78061654 42.17331857 53.9678294 0 10.34768275 9.560664419 7.853550795 0 0 5.783434837 4.973188666 0 0 0 0 0 4.48195665 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
normalized counts for geneA. Normalization made after adding +1 to raw counts of all genes:
gene S01_R1 S01_R2 S01_R3 S01_R4 S01_R5 S02_R1 S02_R2 S02_R3 S02_R4 S02_R5 S03_R1 S03_R2 S03_R3 S03_R4 S03_R5 S04_R1 S04_R2 S04_R3 S04_R4 S04_R5 S05_R1 S05_R2 S05_R3 S05_R4 S05_R5 S06_R1 S06_R2 S06_R3 S06_R4 S06_R5 S07_R1 S07_R2 S07_R3 S07_R4 S07_R5 S08_R1 S08_R2 S08_R3 S08_R4 S08_R5 S09_R1 S09_R2 S09_R3 S09_R4 S09_R5 S10_R1 S10_R2 S10_R3 S10_R4 S10_R5 S11_R1 S11_R2 S11_R3 S11_R4 S11_R5 S12_R1 S12_R2 S12_R3 S12_R4 S12_R5 S13_R1 S13_R2 S13_R3 S13_R4 S13_R5 S14_R1 S14_R2 S14_R3 S14_R4 S14_R5 S15_R1 S15_R2 S15_R3 S15_R4 S15_R5 S16_R1 S16_R2 S16_R3 S16_R4 S16_R5 S17_R1 S17_R2 S17_R3 S17_R4 S17_R5 S18_R1 S18_R2 S18_R3 S18_R4 S18_R5
geneA 1.635774184 1.719785637 1.702839298 1.703605039 1.903465796 1.430847931 1.421261525 1.407608774 1.380372662 1.311115729 1.079183208 1.09682498 1.051560729 1.090648727 1.084936567 162.9431838 223.0955085 84.35498283 195.9930043 132.978557 70.12744619 138.2285046 79.40901995 63.31627122 80.60083613 1.118879863 12.43936071 12.04617822 9.99266181 1.203508913 1.267958091 7.195204691 6.683102755 1.136002475 1.169048978 1.146531926 1.127530336 1.159288781 5.82279321 1.194192964 0.987346071 1.031285973 1.039259226 0.964340819 1.007731369 1.042617247 1.020616152 0.990375684 1.00272009 1.007731369 1.015522513 0.996295515 0.858884969 0.992064448 0.861797818 1.022674459 1.046543383 0.799627887 0.856250253 1.048665614 1.035942571 1.081474139 1.023373892 1.023373892 1.002792029 1.007731369 0.891690055 1.023373892 1.015522513 1.039259226 1.023373892 1.031285973 1.007731369 1.007731369 1.007731369 0.895166785 0.925114122 0.831105998 0.839899029 0.929809369 0.797988453 0.87220481 0.716255333 0.818337073 0.828413752 0.944984884 0.932353706 0.880071705 0.973851877 1.007731369
Thanks Michael for your suggestions.
First of all, the DEseq2 version I'm using is the DESeq21.22.2 and thus not the last one (my fault!). I've updated my DESeq2 version but nothing changes. (Rversion 3.6.1, Platform: x8664-apple-darwin15.6.0 (64-bit), Running under: macOS High Sierra 10.13.6, DESeq2_1.24.0)
I then tried with LTR using the following commands:
For geneA I still obtain high lfcSE:
ThenI tried with Baesyan estimation using the following commands:
For geneA I obtain the following result:
Did I do everything fine? Any other idea?
I forgot to mention that the DESeq() is giving the following warning:
Thanks.
Fred
hi Fred,
I took a look and two things were going on that lead to bad estimates of dispersion. One was that there were many genes with all samples having 0 or very low counts, and these were creating a bad dispersion fit. The fit was improved by removing genes where there weren't 5 samples with a count of 10 or higher.
The other factor was that with so many samples, the standard minimum estimated count was giving overly high estimates of dispersion for these genes that have a low mean across all samples. This also happens in scRNA-seq for example, and the solution to improve DESeq2's dispersion estimates is to lower this value to 1e-6:
These two alterations lead to this gene getting the expected SE:
Great, it worked! Thanks Michael.
Fred
Sorry to bother you again.
I think something is still missing. Using the command below:
The gene called "gene25047" results significantly upregulated:
However it has 0 raw counts in all the S03 and a raw counts > 0 just in 1 replicate
Did I do something wrong?
Thanks. Fred
One thing is that, I don't think this would be significant with LRT or with apeglm code from above. I've seen that both of these approaches do better for these comparisons of many groups where some number of the groups have all 0's. Could you check that?
What I think is happening here is again bad SE estimation, but now it's because the other groups for this gene are informing the model of a low dispersion. The Wald requires well estimated SE and these are difficult cases for it because they lie on the parameter boundary.
Another approach -- given that you informed me of the group relationships over email -- instead of per group comparisons would be to model the expression across the groups using a spline, e.g.
design = ~ns(x)
where x is a numeric specifying the relationship of the groups to each other (e.g. a grid).Hi Michael, I confirm that the gene25047 is not significant using the other approaches you suggested me. I will also try to use design = ~ns(x).
Thanks Fred
Could i take a look at the dds object? Something seems strange. I wonder if the dispersion estimation is being thrown off by some subset of features.
Could you email to maintainer(“DESeq2”)?