Question

DESeq2 false positive result?

1

Entering edit mode

Gregor ▴ 10

@834b5ffd

Last seen 20 months ago

Austria

Let's look at the expression data of three genes in two groups. This is a subset of a real dataset comparing unpaired treated and untreated mouse colon organoids.
By looking at it, it seems like there's a clear difference in Cxcl2, there is likely no difference in Epcam, and there's clearly no difference in U2 (ok, there are two outliers).

df = data.frame(
  group=rep(c("a", "b", "c"), 6),
  U2=c(0, 0, 0, 0, 0, 452.939, 0, 0, 0, 215.705, 0, 427.984, 0, 0, 1411.686, 674.763, 0, 0),
  Cxcl2=c(6, 25, 6, 17, 20, 4, 2, 156, 6, 40, 1010.999, 35, 43, 98, 57, 16, 263, 54),
  Epcam=c(24818, 30270, 24818, 26090, 41797, 42059, 33214, 36154, 40872, 137463, 45677, 80316, 61955, 61683, 82049, 82288, 80172, 67598)
) |> dplyr::filter(group != "c")

gene expression

If I use a generalized linear model with negative binomial famliy as a baseline, the statistics exactly reflect my gut feeling:

summary(glm(Cxcl2 ~ group, family = MASS::negative.binomial(2), data=df))$coefficients["groupb", "Pr(>|t|)"]
summary(glm(Epcam ~ group, family = MASS::negative.binomial(2), data=df))$coefficients["groupb", "Pr(>|t|)"]
summary(glm(U2 ~ group, family = MASS::negative.binomial(2), data=df))$coefficients["groupb", "Pr(>|t|)"]

[1] 0.004115246
[1] 0.5350445
[1] 0.9946097

However, when running DESeq2, suddenly U2 is by far the most significant gene (padj = 3.3e-17 !!):

dds = DESeqDataSetFromMatrix(
  df |> select(-group) |> t() |> round(),
  colData = df |> select(group),
  design = ~group
)
dds = DESeq(dds)
results(dds)

log2 fold change (MLE): group b vs a 
Wald test p-value: group b vs a 
DataFrame with 3 rows and 6 columns
        baseMean log2FoldChange     lfcSE      stat      pvalue        padj
       <numeric>      <numeric> <numeric> <numeric>   <numeric>   <numeric>
U2       77.4418      -25.71397  3.004014  -8.55987 1.12996e-17 3.38987e-17
Cxcl2    53.3127        1.73778  0.489112   3.55294 3.80956e-04 5.71434e-04
Epcam 62107.1289       -1.46626  0.453298  -3.23465 1.21790e-03 1.21790e-03

I don't understand that. Anything I'm doing wrong? If not, why does DESeq behave the way it does?

DESeq2 • 1.7k views

ADD COMMENT • link updated 20 months ago by swbarnes2 ★ 1.4k • written 20 months ago by Gregor ▴ 10

score 2 · Accepted Answer · 2022-11-30

2

Entering edit mode

swbarnes2 ★ 1.4k

@swbarnes2-14086

Last seen 57 minutes ago

San Diego

The average of U2 is different in a than it is in b. You can't expect DESeq to automatically remove outliers, if you judge that those two samples should be omitted, you should do that yourself. There are ways of filtering out genes where, say < 5 samples have more than 10 counts, that would have removed that gene from your dataset.

ADD COMMENT • link 20 months ago swbarnes2 ★ 1.4k

1

Entering edit mode

Yes, also copying over my comments from where this was asked on twitter:

The problem is attempting to compare LFC/SE across genes when one is at the boundary of the parameter space. This breaks down as p-values don’t measure interestingness. I'd recommend to try LRT instead here (deals better with cases where you are at boundary of parameter space) and use shrunken LFC for effect size ranking.

ADD REPLY • link 20 months ago Michael Love 42k