I'm using DESeq2 to analyze count data from a fungal transcriptomics experiment. I've noticed that DESeq2 seems to handle contrasts of count data oddly if all individuals belonging to the factor levels being contrasted have 0 counts.

Below is some dummy code, with values taken from my data. There are two genes, eight individuals, two individuals per condition. DESeq2 suggests that gene1 is highly DE between conditions 2 and 3, even though counts of gene1 in the condition 2 ind's (i3 and i4) and condition 3 ind's (i5 and i6) are all 0.

dummy.counts = data.frame(i1 = c(60,40), i2 = c(80,28), i3 = c(0,2), i4 = c(0,19), i5 = c(0,6342), i6 = c(0,19182), i7 = c(11435,7337), i8 = c(14246,8667)) rownames(dummy.counts) = c("gene1","gene2") dummy.meta = data.frame(ind = c('i1','i2','i3','i4','i5','i6','i7','i8'), condition = c('c1','c1','c2','c2','c3','c3','c4','c4')) dds0 = DESeqDataSetFromMatrix( countData = dummy.counts, colData = dummy.meta, design = ~ condition) dds = DESeq(dds0) results(dds, contrast = c('condition','c2','c3')) log2 fold change (MAP): condition c2 vs c3 Wald test p-value: condition c2 vs c3 DataFrame with 2 rows and 6 columns baseMean log2FoldChange lfcSE stat pvalue padj <numeric> <numeric> <numeric> <numeric> <numeric> <numeric> gene1 346.8049 7.527359e+00 1.4609970 5.152207e+00 2.574381e-07 5.148762e-07 gene2 367.0191 -2.706990e-07 0.4839797 -5.593190e-07 9.999996e-01 9.999996e-01

From playing around, it seems that this is caused by the fact that gene2 is highly expressed under condition 3 but not under condition 2. That is, if you change the gene2 expression to be more similar between the two conditions, DESeq2 no longer reports significant DE:

dummy.counts2 = data.frame(i1 = c(60,40), i2 = c(80,28), i3 = c(0,4000), i4 = c(0,3800), i5 = c(0,6342), i6 = c(0,19182), i7 = c(11435,7337), i8 = c(14246,8667)) dds02 = DESeqDataSetFromMatrix( countData = dummy.counts2, colData = dummy.meta, design = ~ condition) dds2 = DESeq(dds02) results(dds2, contrast = c('condition','c2','c3')) log2 fold change (MAP): condition c2 vs c3 Wald test p-value: condition c2 vs c3 DataFrame with 2 rows and 6 columns baseMean log2FoldChange lfcSE stat pvalue padj <numeric> <numeric> <numeric> <numeric> <numeric> <numeric> 1 1739.173 3.269788e-02 0.2180953 1.499247e-01 0.880824 1 2 1840.544 -2.886503e-10 0.2090734 -1.380617e-09 1.000000 1

It seems that in the first scenario, gene1 is "differentially expressed" between conditions 2 & 3 in the sense that it is 0 counts of 39 in condition 2 individuals and 0 counts out of 25,000 in condition 3 individuals. When it is 0 out of 8,000 vs. 0 out of 25,000, it is not reported as DE.

The above dummy results reflect what I've seen in our true dataset (~11k genes, 6 conditions, 16 usable libraries). Since some fungal isolates infect the sampled tissue better than others, our libraries have a wide range of counts. In the real dataset, the libraries in condition 2 have only 6k-80k counts total, while those in condition 3 have 75M to 110M. Thus, if gene1 is missing from both of these, it will be reported as highly DE.

What is the best way to avoid or mitigate this?

Also I'm working in the devel branch on some logic to set these LFC to 0 (and therefore the Wald stat to 0).

Thanks for the guidance. The unfortunate difference in size factors is due to the fact that we're comparing fungal transcriptomes from infected plant tissue (minimal fungal biomass) to those from pure fungal cultures.

If I understand correctly, the expected count of gene i is higher when s_j is higher, meaning that the estimated q_ij is much lower for the samples with higher s_j (condition 3) than those with low s_j (condition 2). For two groups with 0 counts of gene i, the estimated negative log-FC (beta_ij), proportional to log(q_ij), is more extreme in the libraries with high s_j. Is that roughly on point?

Yes, you've got it, the beta prior bounds the LFC and its SE, so that it is finite instead of infinite, and then the large difference in s_j's forces a large difference in q_j's. It's a unique issue to the beta prior and the 0-0 comparison within a gene with counts in other groups and the large difference in s_j's is confounded with the comparison of interest. But still undesirable, so I've fixed this in version 1.7.32. Thanks for bringing it to my attention. The results() function will 0 out the LFC and Wald statistic, when the contrast is of two groups, both with 0 count.