DESeq2: removing levels from a factor or not having them in the analysis at all
Entering edit mode
mat.lesche ▴ 80
Last seen 22 days ago


I'm working with a RNA-seq data set which contains four different conditions and each condition has 3 replicates. I did a little test and was now wondering what the best way to proceed. It's general questions on what is the best way to process the data.

In the past we removed genes with a low read count (let's say the bottom 25%) and then ran DESeq1. From the vignette in DESeq2 I know that DESeq2 filters out lowly expressed genes. I'm just wondering what is the better way now. Giving DESeq2 the whole data set, which also includes genes that have no read counts at all or already removing genes with low read counts?


The other question I have is about the inclusion of samples in the design or not using them.

The desing of the data set is the following:

> coldata
A1         A
A2         A
A3         A
B1         B
B2         B
B3         B
C1         C
C2         C
C3         C
D1         D
D2         D
D3         D

I ran DESeq2 and extracted the result table for the comparison of A vs. B which showed 8471 differential expressed genes.

dds <- DESeqDataSetFromMatrix(countData = Jcounts, colData = coldata, design = ~ condition)
dds <- DESeq(dds)

result_full_ab = results(dds, contrast=c("condition", "A", "B"))

sum(result_full_ab$padj < 0.1, na.rm=T)
#[1] 8471

I did the same comparison (A vs. B) by removing the factors "C" and "D" from the level which resulted in 7221 differential expressed genes.

# remove the levels C and D
dds_sub <- dds[, dds$condition %in% c("A", "B")]
dds_sub$condition <- droplevels(dds_sub$condition)
dds_sub <- DESeq(dds_sub)

result_sub_ab = results(dds_sub, contrast = c("condition", "A", "B"))
sum(result_sub_ab$padj < 0.1, na.rm=T)
#[1] 7221

Finally I ran DESeq only with samples from "A" and "B" and I have 7123 genes

dds_single_ab <- DESeqDataSetFromMatrix(countData = Jcounts_ab, colData = coldata_ab, design = ~ condition)
dds_single_ab <- DESeq(dds_single_ab)

result_single_ab <- results(dds_single_ab, contrast = c("condition", "A", "B"))
sum(result_single_ab$padj < 0.1, na.rm=T)
#[1] 7123

I compared these three results with each other and the genes overlap between 90%-95%.

What is the best way to build the design for DESeq? Is it better to compare  A and B separately in a single DESeq2 run or have all samples in one DESeq2 run?

Furthermore I did a similar test with a multifactor design and also saw differences in the number of differential expressed genes after I removed leves from one factor and ran DESeq2 again.


Thanks for your help,


deseq2 differentialexpression • 5.6k views
Entering edit mode
Last seen 1 day ago
United States

Regarding which is better, an arbitrary threshold on minimal counts (lowest 25% of mean count) or the independent filtering in DESeq2, I would say the independent filtering, if by better you mean maximizing the number of rejections for the whole experiment. See the Bourgon paper for background:

re: the inclusion of samples: If you expect that conditions C and D have generally similar within-group dispersion variance to A and B, then it is better to include all samples, because it helps improve the estimates of dispersion by increasing the sample size. But again, this is assuming that we are getting a better estimate of the dispersion for a given gene, to make inference on a B vs A comparison, by observing the variance of counts in two more groups. Note that it's not surprising that the set of genes with small FDR change slightly when you change the set of samples used for estimating dispersion: p-values are tail probabilities which are highly sensitive to model parameters.

Entering edit mode

Maybe have a look at a PCA plot (see vignette) to see whether the within-group variance differs between groups.


Login before adding your answer.

Traffic: 178 users visited in the last hour
Help About
Access RSS

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6