Obtaining many more significant differentially expressed genes by subsetting than just using contrasts
Entering edit mode
Tom ▴ 10
Last seen 2 days ago
United Kingdom


I'm using Deseq2 to perform differential expression analysis comparing samples from 2 different cell types in 2 different cell lines (n=12) with the samples sequenced in two batches. This is two comparisons:

A-pos vs A-neg

B-pos vs B-neg

There are 6 of each cell type with 3 pos/neg in each cell type.

I am either running all the samples in one dds with the design replicate + group or subsetting the object using:

dds_subset <- dds[, dds$group == "A_pos" | dds$group == "A_neg"]
dds_subset$group <- droplevels(dds_subset$group)

And then obtaining results with:

#No subsetting
A_res <- results(dds, contrast = c("group","A_pos","A_neg"))

A_subset_res <- results(dds_subset, contrast = c("group","A_pos","A_neg"))

Using this I obtain 422 DE genes without subsetting and 1201 DE genes with subsetting. The log2fold change across the methods are similar but the pvalues/padj are mostly very different.

So I have 3 questions:

  1. Am I using the design and contrasts correctly?
  2. Is the difference in significantly expressed genes across the two approaches expected?
    • If so why does this occur?
  3. And which method (subsetting or not) is a more accurate representation of differential expression?

Thank you, Tom

DESeq2 • 195 views
Entering edit mode
Last seen 2 days ago
United States

This is one of the FAQ in the vignette.

Entering edit mode

To my understanding the section on multiple groups in the FAQ does not entirely explain why there are so many more DE genes found with subsetting that without. In my dataset the within-group variability appears fairly similar across the datasets, so I wouldn't expect there to be enough of a higher sensitivity when subsetting to give ~3x more DE genes found.

Would finding more DE suggest that subsetting the datasets is the correct approach or do you always get more DE genes when subsetting?

Entering edit mode

There's not a correct approach per se. It's a balance between more stable estimates (statistically stable) from more samples vs. heterogeneous data requiring per-group variance estimates, or else we have biased estimates that share information across groups.

It's not unexpected to see a big change in the number of DE genes, because the variance estimates changed, and the test statistic is a ratio with variance in the denominator. It can be that it doesn't seem very different from the PCA plot, but for a number of genes, the estimated variance is much lower when you subset to just the A samples.


Login before adding your answer.

Traffic: 521 users visited in the last hour
Help About
Access RSS

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6