Question: DESeq2: How to manually perform filtering and outlier detection?
gravatar for heikki.sarin
14 months ago by
heikki.sarin0 wrote:


I'm quite sure I have a problem with the DESeq2 independent filtering. DESeq2 doesn't flag any outliers when performing the DE analysis. We have little concern about blood being in the samples as the raw counts vary from hundreds to tens of thousands in some HB related genes. For some reason DESeq2 doesn't get rid of these high count outlier genes which I think should appropriate. Or is possible for some genes to have such big variation in gene expression?

We have filtered out lowly expressed genes before the analysis but I think more filtering should be maybe manually done to get these high count outlier genes removed. When plotting the Q-Q-plot of p-values the trend a bit inflated which is a a slight concern also. 

Help would be really appreciated because not quite sure how to approach this problem.

ADD COMMENTlink modified 14 months ago by Wolfgang Huber13k • written 14 months ago by heikki.sarin0
gravatar for Wolfgang Huber
14 months ago by
EMBL European Molecular Biology Laboratory
Wolfgang Huber13k wrote:


First, it'll be good to clarify concepts regarding

  • outlier removal: removing genes whose data do not appear to fit the mathematical model used by DESeq2 (it's called a "gamma-Poisson generalized linear model") and your design matrix
  • independent filtering: removing genes whose data perfectly well fit the model but that for various reasons appear so unlikely to be detected as differentially expressed that overall (across all genes) detection power is improved by setting them aside

It seems you mean the former.

You can use any method that you like to flag and set aside such genes. I advise visualizing the data from such genes, and some others. With that perhaps you can come up with a programmable rule to automate this task. DESeq2 offers "Cook's distance" as one method to that end; are you sure it was enabled for your analysis?

For more concrete advice, please post a rendered (HTML) Rmarkdown document with your analysis (and plots), and the output of session_info.

Hope this helps


ADD COMMENTlink modified 14 months ago • written 14 months ago by Wolfgang Huber13k

Btw, there is now an (even better) alternative to independent filtering: independent hypothesis weighting, IHW.

ADD REPLYlink written 14 months ago by Wolfgang Huber13k

Yeah the former is my focus here.

1) I think one of the problems is that I used LRT-test. Do understand correctly that with LRT the "cooks distance" is not applied in the analysis? I have studied the "Cooks distance"-values but I'm not quite sure what threshold to use with manual filtering - they range from 0-2. Is there any "thumb rules" as to what is regarded as outliers based on Cooks distance?

2) Also studied the basemean as possible parameter to set a threshold as we have some genes that have expression of millions of reads but from plotting it's hard tell weather they are outliers - especially from normalised counts. If there are outliers with abnormally high variance in reads between libraries what would be the best approach to detect these and filter out of the analysis?

With best regards,


ADD REPLYlink written 14 months ago by heikki.sarin0
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.2.0
Traffic: 263 users visited in the last hour