DESeq2 false positive result?
1
1
Entering edit mode
Gregor ▴ 10
@834b5ffd
Last seen 14 months ago
Austria

Let's look at the expression data of three genes in two groups. This is a subset of a real dataset comparing unpaired treated and untreated mouse colon organoids.
By looking at it, it seems like there's a clear difference in Cxcl2, there is likely no difference in Epcam, and there's clearly no difference in U2 (ok, there are two outliers).

df = data.frame(
group=rep(c("a", "b", "c"), 6),
U2=c(0, 0, 0, 0, 0, 452.939, 0, 0, 0, 215.705, 0, 427.984, 0, 0, 1411.686, 674.763, 0, 0),
Cxcl2=c(6, 25, 6, 17, 20, 4, 2, 156, 6, 40, 1010.999, 35, 43, 98, 57, 16, 263, 54),
Epcam=c(24818, 30270, 24818, 26090, 41797, 42059, 33214, 36154, 40872, 137463, 45677, 80316, 61955, 61683, 82049, 82288, 80172, 67598)
) |> dplyr::filter(group != "c")


If I use a generalized linear model with negative binomial famliy as a baseline, the statistics exactly reflect my gut feeling:

summary(glm(Cxcl2 ~ group, family = MASS::negative.binomial(2), data=df))$coefficients["groupb", "Pr(>|t|)"] summary(glm(Epcam ~ group, family = MASS::negative.binomial(2), data=df))$coefficients["groupb", "Pr(>|t|)"]
summary(glm(U2 ~ group, family = MASS::negative.binomial(2), data=df))\$coefficients["groupb", "Pr(>|t|)"]

[1] 0.004115246
[1] 0.5350445
[1] 0.9946097


However, when running DESeq2, suddenly U2 is by far the most significant gene (padj = 3.3e-17 !!):

dds = DESeqDataSetFromMatrix(
df |> select(-group) |> t() |> round(),
colData = df |> select(group),
design = ~group
)
dds = DESeq(dds)
results(dds)

log2 fold change (MLE): group b vs a
Wald test p-value: group b vs a
DataFrame with 3 rows and 6 columns
baseMean log2FoldChange     lfcSE      stat      pvalue        padj
<numeric>      <numeric> <numeric> <numeric>   <numeric>   <numeric>
U2       77.4418      -25.71397  3.004014  -8.55987 1.12996e-17 3.38987e-17
Cxcl2    53.3127        1.73778  0.489112   3.55294 3.80956e-04 5.71434e-04
Epcam 62107.1289       -1.46626  0.453298  -3.23465 1.21790e-03 1.21790e-03


I don't understand that. Anything I'm doing wrong? If not, why does DESeq behave the way it does?

DESeq2 • 1.5k views
2
Entering edit mode
swbarnes2 ★ 1.3k
@swbarnes2-14086
Last seen 6 hours ago
San Diego

The average of U2 is different in a than it is in b. You can't expect DESeq to automatically remove outliers, if you judge that those two samples should be omitted, you should do that yourself. There are ways of filtering out genes where, say < 5 samples have more than 10 counts, that would have removed that gene from your dataset.

1
Entering edit mode

The problem is attempting to compare LFC/SE across genes when one is at the boundary of the parameter space. This breaks down as p-values donâ€™t measure interestingness. I'd recommend to try LRT instead here (deals better with cases where you are at boundary of parameter space) and use shrunken LFC for effect size ranking.