Question: DESeq2: removing levels from a factor or not having them in the analysis at all
3
4.8 years ago by
mat.lesche70
Germany
mat.lesche70 wrote:

Hello,

I'm working with a RNA-seq data set which contains four different conditions and each condition has 3 replicates. I did a little test and was now wondering what the best way to proceed. It's general questions on what is the best way to process the data.

In the past we removed genes with a low read count (let's say the bottom 25%) and then ran DESeq1. From the vignette in DESeq2 I know that DESeq2 filters out lowly expressed genes. I'm just wondering what is the better way now. Giving DESeq2 the whole data set, which also includes genes that have no read counts at all or already removing genes with low read counts?

The other question I have is about the inclusion of samples in the design or not using them.

The desing of the data set is the following:

> coldata
condition
A1         A
A2         A
A3         A
B1         B
B2         B
B3         B
C1         C
C2         C
C3         C
D1         D
D2         D
D3         D

I ran DESeq2 and extracted the result table for the comparison of A vs. B which showed 8471 differential expressed genes.

dds <- DESeqDataSetFromMatrix(countData = Jcounts, colData = coldata, design = ~ condition)
dds <- DESeq(dds)

result_full_ab = results(dds, contrast=c("condition", "A", "B"))

sum(result_full_ab$padj < 0.1, na.rm=T) #[1] 8471 I did the same comparison (A vs. B) by removing the factors "C" and "D" from the level which resulted in 7221 differential expressed genes. # remove the levels C and D dds_sub <- dds[, dds$condition %in% c("A", "B")]
dds_sub$condition <- droplevels(dds_sub$condition)
dds_sub <- DESeq(dds_sub)

result_sub_ab = results(dds_sub, contrast = c("condition", "A", "B"))
sum(result_sub_ab$padj < 0.1, na.rm=T) #[1] 7221 Finally I ran DESeq only with samples from "A" and "B" and I have 7123 genes dds_single_ab <- DESeqDataSetFromMatrix(countData = Jcounts_ab, colData = coldata_ab, design = ~ condition) dds_single_ab <- DESeq(dds_single_ab) result_single_ab <- results(dds_single_ab, contrast = c("condition", "A", "B")) sum(result_single_ab$padj < 0.1, na.rm=T)
#[1] 7123

I compared these three results with each other and the genes overlap between 90%-95%.

What is the best way to build the design for DESeq? Is it better to compare  A and B separately in a single DESeq2 run or have all samples in one DESeq2 run?

Furthermore I did a similar test with a multifactor design and also saw differences in the number of differential expressed genes after I removed leves from one factor and ran DESeq2 again.

Mathias

modified 4.8 years ago by Michael Love24k • written 4.8 years ago by mat.lesche70
Answer: DESeq2: removing levels from a factor or not having them in the analysis at all
1
4.8 years ago by
Michael Love24k
United States
Michael Love24k wrote:

Regarding which is better, an arbitrary threshold on minimal counts (lowest 25% of mean count) or the independent filtering in DESeq2, I would say the independent filtering, if by better you mean maximizing the number of rejections for the whole experiment. See the Bourgon paper for background:

http://www.pnas.org/content/107/21/9546.long

re: the inclusion of samples: If you expect that conditions C and D have generally similar within-group dispersion variance to A and B, then it is better to include all samples, because it helps improve the estimates of dispersion by increasing the sample size. But again, this is assuming that we are getting a better estimate of the dispersion for a given gene, to make inference on a B vs A comparison, by observing the variance of counts in two more groups. Note that it's not surprising that the set of genes with small FDR change slightly when you change the set of samples used for estimating dispersion: p-values are tail probabilities which are highly sensitive to model parameters.