Question: Does manual pre-filtering of the data on read counts violate the assumptions of the dispersion estimation in DESeq2?
gravatar for Johannes Rainer
3.5 years ago by
Johannes Rainer1.3k
Johannes Rainer1.3k wrote:

Dear all!

Sorry for yet another question on pre-filtering and DESeq2, but I didn't find anything related in the support pages... Now to my question:

I know that DESeq2 does a wonderful job of automatic pre-filtering, just, in my case it did remove a miRNA which, with 260 counts on average is not that low expressed, and in which differential expression I really believe. So, basically, I would like to do the pre-filtering myself, also accepting that I loose quite some power due to the stronger adjustment of multiple hypothesis testing.

My question now however is whether this pre-filtering, i.e. removing of low count features, interferes or violates the assumptions of the dispersion estimation (or any other assumption) in the DESeq2 model (Also considering that the pre-filtering in DESeq2 takes place after calculation or the raw p-values). My concern comes from the (ancient) field of microarrays were a variance based pre-filtering was thought to violate assumptions of the moderated t-test in limma.

Is the situation similar for DESeq2 and manual pre-filtering?


Thanks in advance!

cheers, jo

ADD COMMENTlink modified 3.5 years ago by Michael Love19k • written 3.5 years ago by Johannes Rainer1.3k
gravatar for Michael Love
3.5 years ago by
Michael Love19k
United States
Michael Love19k wrote:

hi Johannes,

Removing low count features in the beginning, before DESeq(), is ok for doing your own filtering. Filtering on any kind of location statistic of the normalized counts across all samples is valid for the method. I wouldn't recommend variance filtering, because we look at the total distribution of dispersion estimates during the estimation steps. 

The automatic independent filtering is generally useful, except in cases like yours when it's not, with specific genes under the threshold. The genefilter mechanism will always raise the threshold above a potentially significant gene with baseMean x, if this means adding more than one genes with baseMean y, where x < y. 

ADD COMMENTlink written 3.5 years ago by Michael Love19k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.2.0
Traffic: 467 users visited in the last hour