DESeq: low p-values for genes with zero counts in all but one sample of the compared groups
1
0
Entering edit mode
@antonio-miguel-de-jesus-domingues-5182
Last seen 8 weeks ago
Germany

Hi all,

once in a while I come upon an DESeq analysis which results in a large number of genes with low p.adjust values (~significant genes) but with a somewhat unsatisfying expression pattern: zero counts in nearly all samples. It is very likely I am missing a modelling parameter but I did search for solution and didn't find anything.

For example, in this re-analysis of a published dataset there are two distinct clouds with extreme fold-changes but at the low end of expression:

plotMA(res, ylim=c(min(res$log2FoldChange, na.rm = T), max(res$log2FoldChange, na.rm = T)))

These contain quite a few differentially expressed genes, and when I take the one with the lowest padj we can see that it has zero counts in all of the control samples (scrambled_KCl)

plotCounts(dds, gene=which.min(res$padj), intgroup="siRNA_stimulus")

gene_id <- rownames(res[which.min(res$padj), ])
samples <- subset(sample_info, siRNA_stimulus %in% c("scrambled_KCl", "NEAT1_KCl")) %>%
    dplyr::arrange(siRNA_stimulus) %>%
    .$sample_id

cnts[gene_id, samples] %>%
    knitr::kable()

x
ERR841606 0
ERR841614 0
ERR841619 0
ERR841607 269
ERR841611 213
ERR841618 236

On the one had it makes sense that DESeq calls this gene as differentially expressed. However, if we take the 20 genes with the lowest 20 padj values we see the pattern repeating and something even more interesting:

res_20 <- res[order(res$padj), ] %>% head(20)

res_20 %>%
    knitr::kable()


cnts[rownames(res_20), samples]  %>%
    knitr::kable()
baseMean log2FoldChange lfcSE stat pvalue padj
ENSG00000064787 281.8787 21.46530 1.829856 11.730597 0 0
ENSG00000138136 301.6668 -24.53871 2.102424 -11.671629 0 0
ENSG00000261302 151.4447 -22.65529 2.098731 -10.794760 0 0
ENSG00000234913 181.8306 21.73508 2.021428 10.752341 0 0
ENSG00000276591 226.5125 21.68802 2.035174 10.656591 0 0
ENSG00000223843 221.6896 22.60131 2.181985 10.358141 0 0
ENSG00000235649 277.0373 21.39288 2.088070 10.245290 0 0
ENSG00000260036 170.4672 21.07013 2.068249 10.187428 0 0
ENSG00000261774 163.4805 21.30677 2.128112 10.012052 0 0
ENSG00000277304 232.6921 21.53662 2.159207 9.974319 0 0
ENSG00000268066 187.8949 21.43550 2.151105 9.964878 0 0
ENSG00000223431 208.2810 21.17024 2.135009 9.915763 0 0
ENSG00000062096 154.0686 21.51732 2.178597 9.876689 0 0
ENSG00000262248 128.2776 -23.73491 2.406768 -9.861733 0 0
ENSG00000259691 189.3708 20.87996 2.120870 9.844999 0 0
ENSG00000200997 311.3471 -22.31373 2.284490 -9.767488 0 0
ENSG00000213877 195.0722 21.81423 2.236702 9.752854 0 0
ENSG00000201821 110.8460 -24.38071 2.510122 -9.712957 0 0
ENSG00000276957 155.2442 -23.59726 2.452138 -9.623138 0 0
ENSG00000237260 162.8905 -24.11136 2.509733 -9.607141 0 0
ERR841606 ERR841614 ERR841619 ERR841607 ERR841611 ERR841618
ENSG00000064787 0 0 0 269 213 236
ENSG00000138136 0 672 0 0 0 0
ENSG00000261302 159 0 0 0 0 0
ENSG00000234913 0 0 0 530 224 124
ENSG00000276591 0 0 0 162 540 169
ENSG00000223843 0 0 0 432 675 463
ENSG00000235649 0 0 0 618 116 0
ENSG00000260036 0 0 0 88 136 273
ENSG00000261774 0 0 0 214 76 289
ENSG00000277304 0 0 0 319 198 235
ENSG00000268066 0 0 0 294 340 100
ENSG00000223431 0 0 0 249 217 129
ENSG00000062096 0 0 0 172 288 259
ENSG00000262248 112 92 107 0 0 0
ENSG00000259691 0 0 0 249 131 101
ENSG00000200997 0 123 0 0 0 0
ENSG00000213877 0 0 0 143 318 404
ENSG00000201821 157 86 271 0 0 0
ENSG00000276957 0 0 267 0 0 0
ENSG00000237260 457 0 0 0 0 0

Several genes have a single sample with counts > 0 for samples in those groups being compared. In this case one would assume that DESeq would not attribute a low p-value two those because there is little evidence for difference expression in those two groups. Showing the counts for all groups, we can see that this gene happens to be highly expressed in the samples of other groups:

samples_ordered <- sample_info %>%
    dplyr::arrange(siRNA_stimulus)

cnts["ENSG00000237260", samples_ordered$sample_id]

plotCounts(dds, gene="ENSG00000237260", intgroup="siRNA_stimulus")

My question, or doubt, is it expected that this gene would have a low p-value? If seems that while it sort of this, the model might consider that there is enough evidence in the the sample set as whole to attribute a low p-value, but IMO as a biologist, it doesn't pass the "eye test".

As a follow question, is there any way to model out these genes? In principle one could do pre-filtering for genes not expressed in less than x% samples of any condition, but for practical reasons I would rather not (merging results downstream), and I was under the assumption that DESeq would handle these cases elegantly.

Cheers!

DESeq2 DifferentialExpression • 695 views
0
Entering edit mode
@james-w-macdonald-5106
Last seen 26 minutes ago
United States

What practical reason do you have for not wanting to filter these genes? It is very common to filter genes that have zero counts in most samples because you can get spurious results like this.

ADD COMMENT
0
Entering edit mode

Quite often there is a need to merge datasets, for example compare log2 fold changes or intersect differential expressed genes between two or more separate analysis. Having different subsets of genes for each of those analysis, whilst not an insurmountable problem, is a bit of pain when merging missing values. This is avoidable the results tables have consistently the same genes*.

But regardless of these minor considerations, the important issue is that there is a general assumption (recommendation) that filtering genes is not necessary in DESeq2 because the model handles low counts etc very well (which it does!). However example above shows that there are caveats with this blanket recommendation.

  • I can also set the values and FC to NA in the results tables for these genes.

Login before adding your answer.

Traffic: 944 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6