Search
Question: DESeq2: removing levels from a factor or not having them in the analysis at all
0
gravatar for mat.lesche
3.5 years ago by
mat.lesche20
Germany
mat.lesche20 wrote:

Hello,

I'm working with a RNA-seq data set which contains four different conditions and each condition has 3 replicates. I did a little test and was now wondering what the best way to proceed. It's general questions on what is the best way to process the data.

In the past we removed genes with a low read count (let's say the bottom 25%) and then ran DESeq1. From the vignette in DESeq2 I know that DESeq2 filters out lowly expressed genes. I'm just wondering what is the better way now. Giving DESeq2 the whole data set, which also includes genes that have no read counts at all or already removing genes with low read counts?

 

The other question I have is about the inclusion of samples in the design or not using them.

The desing of the data set is the following:

> coldata
             condition
A1         A
A2         A
A3         A
B1         B
B2         B
B3         B
C1         C
C2         C
C3         C
D1         D
D2         D
D3         D

I ran DESeq2 and extracted the result table for the comparison of A vs. B which showed 8471 differential expressed genes.

dds <- DESeqDataSetFromMatrix(countData = Jcounts, colData = coldata, design = ~ condition)
dds <- DESeq(dds)

result_full_ab = results(dds, contrast=c("condition", "A", "B"))

sum(result_full_ab$padj < 0.1, na.rm=T)
#[1] 8471

I did the same comparison (A vs. B) by removing the factors "C" and "D" from the level which resulted in 7221 differential expressed genes.

# remove the levels C and D
dds_sub <- dds[, dds$condition %in% c("A", "B")]
dds_sub$condition <- droplevels(dds_sub$condition)
dds_sub <- DESeq(dds_sub)

result_sub_ab = results(dds_sub, contrast = c("condition", "A", "B"))
sum(result_sub_ab$padj < 0.1, na.rm=T)
#[1] 7221

Finally I ran DESeq only with samples from "A" and "B" and I have 7123 genes

dds_single_ab <- DESeqDataSetFromMatrix(countData = Jcounts_ab, colData = coldata_ab, design = ~ condition)
dds_single_ab <- DESeq(dds_single_ab)

result_single_ab <- results(dds_single_ab, contrast = c("condition", "A", "B"))
sum(result_single_ab$padj < 0.1, na.rm=T)
#[1] 7123

I compared these three results with each other and the genes overlap between 90%-95%.

What is the best way to build the design for DESeq? Is it better to compare  A and B separately in a single DESeq2 run or have all samples in one DESeq2 run?

Furthermore I did a similar test with a multifactor design and also saw differences in the number of differential expressed genes after I removed leves from one factor and ran DESeq2 again.

 

Thanks for your help,

Mathias

ADD COMMENTlink modified 3.5 years ago by Michael Love17k • written 3.5 years ago by mat.lesche20
0
gravatar for Michael Love
3.5 years ago by
Michael Love17k
United States
Michael Love17k wrote:

Regarding which is better, an arbitrary threshold on minimal counts (lowest 25% of mean count) or the independent filtering in DESeq2, I would say the independent filtering, if by better you mean maximizing the number of rejections for the whole experiment. See the Bourgon paper for background: 

http://www.pnas.org/content/107/21/9546.long

re: the inclusion of samples: If you expect that conditions C and D have generally similar within-group dispersion variance to A and B, then it is better to include all samples, because it helps improve the estimates of dispersion by increasing the sample size. But again, this is assuming that we are getting a better estimate of the dispersion for a given gene, to make inference on a B vs A comparison, by observing the variance of counts in two more groups. Note that it's not surprising that the set of genes with small FDR change slightly when you change the set of samples used for estimating dispersion: p-values are tail probabilities which are highly sensitive to model parameters.

ADD COMMENTlink modified 3.5 years ago • written 3.5 years ago by Michael Love17k

Maybe have a look at a PCA plot (see vignette) to see whether the within-group variance differs between groups.

ADD REPLYlink written 3.5 years ago by Simon Anders3.4k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.2.0
Traffic: 190 users visited in the last hour