Difference between DESeqDataSetFromMatrix() function and DESeq() function
Entering edit mode
Abir.khazaal ▴ 10
Last seen 6 weeks ago

Hi, I am currently performing differential expression analysis using DESeq2.

I want to filter out lowly expressed genes, although I read on another post here that this may not be necessary because IndependentFiltering within results() kind of does that. However, I am comparing different approaches for differential expression analysis and I need to follow the same "criteria" kind of.

What I want to know is, what is the difference between

Code should be placed in three backticks as shown below

# and

I have seen some performing filtering before utilising DESeq() function

dds <- DESeqDataSetFromMatrix(countData = countData,
                              colData = metaData,
                              design = ~ condition) 

keep <- rowSums(counts(dds) >= 10) >= 10
dds <- dds[keep,]

dds <- DESeq(dds)
normalizedCounts <- counts(dds, normalized=TRUE)

Whilst the developer utilised DESeq() function and then performed filtering

dds <- DESeqDataSetFromMatrix(countData = countData,
                              colData = metaData,
                              design = ~ condition) 
dds <- DESeq(dds)
dds <- estimateSizeFactors(dds)

# Apply the filtering criteria
idx <- rowSums(counts(dds, normalized=TRUE) >= 10) >= 10
dds <- dds[idx,]

dds <- DESeq(dds)

So I just want to understand which approach is the right one and why :)


DESeq DESeq2 • 260 views
Entering edit mode
ATpoint ★ 3.4k
Last seen 57 minutes ago

Please follow the manual.

This is estimateSizeFactors() is part of DESeq() so skip that. Filter on raw, nor normalized counts, see vignette.

Entering edit mode

Thank you for that @atpoint.

I have seen the steps above in the vignette but got confused when I saw a thread where the developer performed prefiltering using estimateSizeFactors(). Here deseq2 filter the low counts

One question, I didn't quite understand your last sentence. Why should I filter on raw data? doing so will not take into account the differences in library sizes and sequencing depths?! When I performed DE using edgeR, I performed pre-filtering on cpm values. I added my edgeR (pre-filtering) code below

Your help with this is highly appreciated


# Prepare raw counts as a DGEList object
dge <- DGEList(counts = countData)

# Obtain CPM values using cpm
cpm_values <- cpm(dge)

# Filter genes that have at least 10 CPM in at least 10 samples
keep <- rowSums(cpm_values > 10) >= 10

# Subset DGEList object to keep only selected genes
dge <- dge[keep, , keep.lib.sizes=FALSE] 

# create a design matrix
design <- model.matrix(~0 +AGE, data=metaData) 

# Estimate common and tagwise dispersions
dge <- estimateDisp(dge, design)

#fit linear model .. etc.
Entering edit mode

My advise is to always follow the manual unless you have expert knowledge to do something else. The linked thread is 8 years old, and recommendation by developers change over time. In the edgeR manual it doesn't recommend to filter on cpms, it uses filterByExpr. It is on you to follow to best practices in the manuals or do something custom. Please see the manuals of both edgeR and DESeq2, they contain code suggestions on prefiltering.


Login before adding your answer.

Traffic: 798 users visited in the last hour
Help About
Access RSS

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6