Question: deseq2 filter the low counts
gravatar for aristotele_m
4.7 years ago by
aristotele_m30 wrote:

Dear all,

How can I filter the  counts with low count in Deseq2?  Any suggestion on how to do?



filter<-rowsum(count)> 10

thanks so much!!

deseq2 lowcount • 17k views
ADD COMMENTlink modified 3.1 years ago by pkachroo10 • written 4.7 years ago by aristotele_m30
Answer: deseq2 filter the low counts
gravatar for Michael Love
4.7 years ago by
Michael Love26k
United States
Michael Love26k wrote:


If you want to filter, you can do so before running DESeq:

dds <- estimateSizeFactors(dds)
idx <- rowSums( counts(dds, normalized=TRUE) >= 5 ) >= 3

This would say, e.g. filter out genes where there are less than 3 samples with normalized counts greater than or equal to 5.


dds <- dds[idx,]
dds <- DESeq(dds)

However, you typically don't need to pre-filter because independent filtering occurs within results() to save you from multiple test correction on genes with no power (see ?results and the vignette section about independent filtering, or the paper). The main reason to pre-filter would be to increase speed. Designs with many samples and many interaction terms can slow down on genes which have very few reads.

ADD COMMENTlink modified 3.5 years ago • written 4.7 years ago by Michael Love26k

Does it mean that DESeq2 has no problem estimating dispersions for low expressed features, as opposed to voom?

ADD REPLYlink written 4.0 years ago by Nik Tuzov70
In our DESeq2 paper we discuss a case where estimation of dispersion is difficult for genes with very, very low average counts. See the methods. However it doesn't really effect the outcome because these genes have almost no power for detecting differential expression.
ADD REPLYlink modified 3.5 years ago • written 4.0 years ago by Michael Love26k

Sorry, right now I faced a point, please help me to be cleared if I am wrong. I noticed that in differential expression analysis by DESeq2, the distribution of read counts of differentially expressed genes is in favour of more highly expressed genes. I mean, likely DESeq2 has a threshold for ignoring too low expressed genes before differential expression analysis. Actually I was expected the genes with too low read counts or zeros are the reason of differential expression but box plot shows that the DE genes are among the genes with higher reads counts.

ADD REPLYlink written 17 months ago by Fereshteh20

More highly expressed genes have higher power for detecting DE.

And yes we do have an internal filter that optimizes this.

Like all important aspects of the method it is discussed in the paper and in the vignette.

ADD REPLYlink written 17 months ago by Michael Love26k

Thank you, you alright

ADD REPLYlink written 17 months ago by Fereshteh20

Sorry, by using these lines

dds <- DESeq(dds, minReplicatesForReplace=Inf)
res <- results(dds, cooksCutoff=FALSE, independentFiltering=FALSE)

will I prevent internal filtering in DESeq2 to remove any genes in differential expression? I am right??

ADD REPLYlink written 17 months ago by Fereshteh20

Yes this will turn off independent filtering (on the mean of counts), as well as outlier replacement and outlier-based gene filtering.

ADD REPLYlink written 17 months ago by Michael Love26k

Hi Michael, thanks for your posts - they are really helpful! I was wondering though, isn't there any issue with using `estimateSizeFactors(dds)` twice? Because the DESeq function is going to do this again, no?


ADD REPLYlink written 11 months ago by rodrigo.duarte8820

DESeq() does not re-estimate size factors. It will print this message also when you run it.

ADD REPLYlink written 11 months ago by Michael Love26k
Answer: deseq2 filter the low counts
gravatar for pkachroo
3.1 years ago by
pkachroo10 wrote:

I have a similar question: In an experiment with 5 strains in triplicates, I have a gene with the following normalized counts:


Strain-1: 0,0,0

Strain-2: 0,0,0

Strain-3: 1.6,1.3,0

Strain-4: 0,0, 2.6

Strain-5: 105,102,101

After running DESeq2, this gene is flagged and given "NA" for pvalue and adjusted.value, which makes sense. However, when I rerun the analysis with only first two replicates per strain (highlighted bold) and compare strain 5 and 4, this gene comes up as differentially expressed: baseMean=9.5 and log2FoldChange=3.3. I am wondering why is this gene not being flagged? and more importantly, how is deseq2 able to compute a fold change when the normalized counts for this gene in strain-4 are zeros. 

Appreciate your help.

Priyanka Kachroo


ADD COMMENTlink written 3.1 years ago by pkachroo10

The question about calculating fold changes when strain 4 has zeros has been answered on the site a few times but it's difficult to find the post. The short answer is that the DESeq2 statistical model (see paper) uses a prior distribution on the fold changes, and returns posterior estimates. So the posterior is a balance of the likelihood (which would give an infinite fold change) and the prior which is calculated based on the range of fold changes from the most DE genes.

Regarding the NA's:

If you read in the help page ?results about NA values in the pvalue column:

By default, results assigns a p-value of NA to genes containing count outliers, as identified using Cook’s distance. See the cooksCutoff argument for control of this behavior.

Then if you read more:

cooksCutoff - theshold on Cook’s distance, such that if one or more samples for a row have a distance higher, the p-value for the row is set to NA. The default cutoff is the .99 quantile of the F(p, m-p) distribution, where p is the number of coefficients being fitted and m is the number of samples. Set to Inf or FALSE to disable the resetting of p-values to NA. Note: this test excludes the Cook’s distance of samples belonging to experimental groups with only 2 samples.
ADD REPLYlink modified 3.1 years ago • written 3.1 years ago by Michael Love26k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 16.09
Traffic: 240 users visited in the last hour