Question

DESeq2: removing levels from a factor or not having them in the analysis at all

3

Entering edit mode

mat.lesche ▴ 90

@matlesche-6835

Last seen 24 months ago

Germany

Hello,

I'm working with a RNA-seq data set which contains four different conditions and each condition has 3 replicates. I did a little test and was now wondering what the best way to proceed. It's general questions on what is the best way to process the data.

In the past we removed genes with a low read count (let's say the bottom 25%) and then ran DESeq1. From the vignette in DESeq2 I know that DESeq2 filters out lowly expressed genes. I'm just wondering what is the better way now. Giving DESeq2 the whole data set, which also includes genes that have no read counts at all or already removing genes with low read counts?

The other question I have is about the inclusion of samples in the design or not using them.

The desing of the data set is the following:

> coldata
             condition
A1         A
A2         A
A3         A
B1         B
B2         B
B3         B
C1         C
C2         C
C3         C
D1         D
D2         D
D3         D

I ran DESeq2 and extracted the result table for the comparison of A vs. B which showed 8471 differential expressed genes.

dds <- DESeqDataSetFromMatrix(countData = Jcounts, colData = coldata, design = ~ condition)
dds <- DESeq(dds)

result_full_ab = results(dds, contrast=c("condition", "A", "B"))

sum(result_full_ab$padj < 0.1, na.rm=T)
#[1] 8471

I did the same comparison (A vs. B) by removing the factors "C" and "D" from the level which resulted in 7221 differential expressed genes.

# remove the levels C and D
dds_sub <- dds[, dds$condition %in% c("A", "B")]
dds_sub$condition <- droplevels(dds_sub$condition)
dds_sub <- DESeq(dds_sub)

result_sub_ab = results(dds_sub, contrast = c("condition", "A", "B"))
sum(result_sub_ab$padj < 0.1, na.rm=T)
#[1] 7221

Finally I ran DESeq only with samples from "A" and "B" and I have 7123 genes

dds_single_ab <- DESeqDataSetFromMatrix(countData = Jcounts_ab, colData = coldata_ab, design = ~ condition)
dds_single_ab <- DESeq(dds_single_ab)

result_single_ab <- results(dds_single_ab, contrast = c("condition", "A", "B"))
sum(result_single_ab$padj < 0.1, na.rm=T)
#[1] 7123

I compared these three results with each other and the genes overlap between 90%-95%.

What is the best way to build the design for DESeq? Is it better to compare A and B separately in a single DESeq2 run or have all samples in one DESeq2 run?

Furthermore I did a similar test with a multifactor design and also saw differences in the number of differential expressed genes after I removed leves from one factor and ran DESeq2 again.

Thanks for your help,

Mathias

deseq2 differentialexpression • 6.7k views

ADD COMMENT • link updated 9.6 years ago by Michael Love 41k • written 9.6 years ago by mat.lesche ▴ 90

score 1 · Answer 1 · 2014-10-08

Regarding which is better, an arbitrary threshold on minimal counts (lowest 25% of mean count) or the independent filtering in DESeq2, I would say the independent filtering, if by better you mean maximizing the number of rejections for the whole experiment. See the Bourgon paper for background:

http://www.pnas.org/content/107/21/9546.long

re: the inclusion of samples: If you expect that conditions C and D have generally similar within-group dispersion ~~variance~~ to A and B, then it is better to include all samples, because it helps improve the estimates of dispersion by increasing the sample size. But again, this is assuming that we are getting a better estimate of the dispersion for a given gene, to make inference on a B vs A comparison, by observing the variance of counts in two more groups. Note that it's not surprising that the set of genes with small FDR change slightly when you change the set of samples used for estimating dispersion: p-values are tail probabilities which are highly sensitive to model parameters.