Question

Does manual pre-filtering of the data on read counts violate the assumptions of the dispersion estimation in DESeq2?

0

Entering edit mode

Johannes Rainer ★ 2.1k

@johannes-rainer-6987

Last seen 15 months ago

Italy

Dear all!

Sorry for yet another question on pre-filtering and DESeq2, but I didn't find anything related in the support pages... Now to my question:

I know that DESeq2 does a wonderful job of automatic pre-filtering, just, in my case it did remove a miRNA which, with 260 counts on average is not that low expressed, and in which differential expression I really believe. So, basically, I would like to do the pre-filtering myself, also accepting that I loose quite some power due to the stronger adjustment of multiple hypothesis testing.

My question now however is whether this pre-filtering, i.e. removing of low count features, interferes or violates the assumptions of the dispersion estimation (or any other assumption) in the DESeq2 model (Also considering that the pre-filtering in DESeq2 takes place after calculation or the raw p-values). My concern comes from the (ancient) field of microarrays were a variance based pre-filtering was thought to violate assumptions of the moderated t-test in limma.

Is the situation similar for DESeq2 and manual pre-filtering?

Thanks in advance!

cheers, jo

deseq2 pre-filtering • 2.3k views

ADD COMMENT • link updated 10.8 years ago by Michael Love 43k • written 10.8 years ago by Johannes Rainer ★ 2.1k

score 1 · Accepted Answer · 2015-04-08

hi Johannes,

Removing low count features in the beginning, before DESeq(), is ok for doing your own filtering. Filtering on any kind of location statistic of the normalized counts across all samples is valid for the method. I wouldn't recommend variance filtering, because we look at the total distribution of dispersion estimates during the estimation steps.

The automatic independent filtering is generally useful, except in cases like yours when it's not, with specific genes under the threshold. The genefilter mechanism will always raise the threshold above a potentially significant gene with baseMean x, if this means adding more than one genes with baseMean y, where x < y.