Hello,

I'm working with a RNA-seq data set which contains four different conditions and each condition has 3 replicates. I did a little test and was now wondering what the best way to proceed. It's general questions on what is the best way to process the data.

In the past we removed genes with a low read count (let's say the bottom 25%) and then ran DESeq1. From the vignette in DESeq2 I know that DESeq2 filters out lowly expressed genes. I'm just wondering what is the better way now. Giving DESeq2 the whole data set, which also includes genes that have no read counts at all or already removing genes with low read counts?

The other question I have is about the inclusion of samples in the design or not using them.

The desing of the data set is the following:

> coldata

condition

A1 A

A2 A

A3 A

B1 B

B2 B

B3 B

C1 C

C2 C

C3 C

D1 D

D2 D

D3 D

I ran DESeq2 and extracted the result table for the comparison of A vs. B which showed 8471 differential expressed genes.

dds <- DESeqDataSetFromMatrix(countData = Jcounts, colData = coldata, design = ~ condition)

dds <- DESeq(dds)

result_full_ab = results(dds, contrast=c("condition", "A", "B"))

sum(result_full_ab$padj < 0.1, na.rm=T)

#[1] 8471

I did the same comparison (A vs. B) by removing the factors "C" and "D" from the level which resulted in 7221 differential expressed genes.

# remove the levels C and D

dds_sub <- dds[, dds$condition %in% c("A", "B")]

dds_sub$condition <- droplevels(dds_sub$condition)

dds_sub <- DESeq(dds_sub)

result_sub_ab = results(dds_sub, contrast = c("condition", "A", "B"))

sum(result_sub_ab$padj < 0.1, na.rm=T)

#[1] 7221

Finally I ran DESeq only with samples from "A" and "B" and I have 7123 genes

dds_single_ab <- DESeqDataSetFromMatrix(countData = Jcounts_ab, colData = coldata_ab, design = ~ condition)

dds_single_ab <- DESeq(dds_single_ab)

result_single_ab <- results(dds_single_ab, contrast = c("condition", "A", "B"))

sum(result_single_ab$padj < 0.1, na.rm=T)

#[1] 7123

I compared these three results with each other and the genes overlap between 90%-95%.

What is the best way to build the design for DESeq? Is it better to compare A and B separately in a single DESeq2 run or have all samples in one DESeq2 run?

Furthermore I did a similar test with a multifactor design and also saw differences in the number of differential expressed genes after I removed leves from one factor and ran DESeq2 again.

Thanks for your help,

Mathias

Maybe have a look at a PCA plot (see vignette) to see whether the within-group variance differs between groups.