DESeq2 detects DE when expression is 0 under both contrast levels
1
0
Entering edit mode
tw372 • 0
@tw372-7401
Last seen 9.2 years ago
United States

I'm using DESeq2 to analyze count data from a fungal transcriptomics experiment. I've noticed that DESeq2 seems to handle contrasts of count data oddly if all individuals belonging to the factor levels being contrasted have 0 counts.

Below is some dummy code, with values taken from my data. There are two genes, eight individuals, two individuals per condition. DESeq2 suggests that gene1 is highly DE between conditions 2 and 3, even though counts of gene1 in the condition 2 ind's (i3 and i4) and condition 3 ind's (i5 and i6) are all 0. 

dummy.counts = data.frame(i1 = c(60,40),
                          i2 = c(80,28),
                          i3 = c(0,2),
                          i4 = c(0,19),
                          i5 = c(0,6342),
                          i6 = c(0,19182),
                          i7 = c(11435,7337),
                          i8 = c(14246,8667))
rownames(dummy.counts) = c("gene1","gene2")

dummy.meta = data.frame(ind = c('i1','i2','i3','i4','i5','i6','i7','i8'),
                        condition = c('c1','c1','c2','c2','c3','c3','c4','c4'))

dds0 = DESeqDataSetFromMatrix(
  countData = dummy.counts,
  colData = dummy.meta,
  design = ~ condition)

dds = DESeq(dds0)

results(dds, contrast = c('condition','c2','c3'))

log2 fold change (MAP): condition c2 vs c3 
Wald test p-value: condition c2 vs c3 
DataFrame with 2 rows and 6 columns
       baseMean log2FoldChange     lfcSE          stat       pvalue         padj
      <numeric>      <numeric> <numeric>     <numeric>    <numeric>    <numeric>
gene1  346.8049   7.527359e+00 1.4609970  5.152207e+00 2.574381e-07 5.148762e-07
gene2  367.0191  -2.706990e-07 0.4839797 -5.593190e-07 9.999996e-01 9.999996e-01

From playing around, it seems that this is caused by the fact that gene2 is highly expressed under condition 3 but not under condition 2. That is, if you change the gene2 expression to be more similar between the two conditions, DESeq2 no longer reports significant DE:

dummy.counts2 = data.frame(i1 = c(60,40),
                          i2 = c(80,28),
                          i3 = c(0,4000),
                          i4 = c(0,3800),
                          i5 = c(0,6342),
                          i6 = c(0,19182),
                          i7 = c(11435,7337),
                          i8 = c(14246,8667))
dds02 = DESeqDataSetFromMatrix(
  countData = dummy.counts2,
  colData = dummy.meta,
  design = ~ condition)

dds2 = DESeq(dds02)

results(dds2, contrast = c('condition','c2','c3'))

log2 fold change (MAP): condition c2 vs c3 
Wald test p-value: condition c2 vs c3 
DataFrame with 2 rows and 6 columns
   baseMean log2FoldChange     lfcSE          stat    pvalue      padj
  <numeric>      <numeric> <numeric>     <numeric> <numeric> <numeric>
1  1739.173   3.269788e-02 0.2180953  1.499247e-01  0.880824         1
2  1840.544  -2.886503e-10 0.2090734 -1.380617e-09  1.000000         1

It seems that in the first scenario, gene1 is "differentially expressed" between conditions 2 & 3 in the sense that it is 0 counts of 39 in condition 2 individuals and 0 counts out of 25,000 in condition 3 individuals. When it is 0 out of 8,000 vs. 0 out of 25,000, it is not reported as DE.

The above dummy results reflect what I've seen in our true dataset (~11k genes, 6 conditions, 16 usable libraries). Since some fungal isolates infect the sampled tissue better than others, our libraries have a wide range of counts. In the real dataset, the libraries in condition 2 have only 6k-80k counts total, while those in condition 3 have 75M to 110M. Thus, if gene1 is missing from both of these, it will be reported as highly DE.

What is the best way to avoid or mitigate this?

deseq2 rnaseq r • 1.6k views
ADD COMMENT
1
Entering edit mode
@mikelove
Last seen 1 day ago
United States

It appears, when the size factors have a dynamic range of >10,000 (which is extreme), the size-factor-adjusted expected count given a 0 for these two conditions is very different and this is reflected in a large coefficient.

For such a dataset, I would recommend subsetting to the two groups you want to compare, and running DESeq() on the subset.

Also, I wouldn't use a betaPrior on this dataset (so DESeq(dds, betaPrior=FALSE) ), as the log fold changes with such extreme size factors might not behave as expected when forced by the prior toward 0.

ADD COMMENT
1
Entering edit mode

Also I'm working in the devel branch on some logic to set these LFC to 0 (and therefore the Wald stat to 0).

ADD REPLY
0
Entering edit mode

Thanks for the guidance. The unfortunate difference in size factors is due to the fact that we're comparing fungal transcriptomes from infected plant tissue (minimal fungal biomass) to those from pure fungal cultures.

If I understand correctly, the expected count of gene i is higher when s_j is higher, meaning that the estimated q_ij is much lower for the samples with higher s_j (condition 3) than those with low s_j (condition 2). For two groups with 0 counts of gene i, the estimated negative log-FC (beta_ij), proportional to log(q_ij), is more extreme in the libraries with high s_j. Is that roughly on point?

ADD REPLY
1
Entering edit mode

Yes, you've got it, the beta prior bounds the LFC and its SE, so that it is finite instead of infinite, and then the large difference in s_j's forces a large difference in q_j's. It's a unique issue to the beta prior and the 0-0 comparison within a gene with counts in other groups and the large difference in s_j's is confounded with the comparison of interest. But still undesirable, so I've fixed this in version 1.7.32. Thanks for bringing it to my attention. The results() function will 0 out the LFC and Wald statistic, when the contrast is of two groups, both with 0 count.

ADD REPLY

Login before adding your answer.

Traffic: 842 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6