Question

DESeq2: How to manually perform filtering and outlier detection?

0

Entering edit mode

heikki.sarin ▴ 10

@heikkisarin-13379

Last seen 3.3 years ago

Hi,

I'm quite sure I have a problem with the DESeq2 independent filtering. DESeq2 doesn't flag any outliers when performing the DE analysis. We have little concern about blood being in the samples as the raw counts vary from hundreds to tens of thousands in some HB related genes. For some reason DESeq2 doesn't get rid of these high count outlier genes which I think should appropriate. Or is possible for some genes to have such big variation in gene expression?

We have filtered out lowly expressed genes before the analysis but I think more filtering should be maybe manually done to get these high count outlier genes removed. When plotting the Q-Q-plot of p-values the trend a bit inflated which is a a slight concern also.

Help would be really appreciated because not quite sure how to approach this problem.

deseq2 filtering outliers count outliers • 5.1k views

ADD COMMENT • link updated 6.7 years ago by Wolfgang Huber ★ 13k • written 6.7 years ago by heikki.sarin ▴ 10

score 1 · Answer 1 · 2017-08-10

1

Entering edit mode

Wolfgang Huber ★ 13k

@wolfgang-huber-3550

Last seen 9 days ago

EMBL European Molecular Biology Laborat…

Heikki

First, it'll be good to clarify concepts regarding

outlier removal: removing genes whose data do not appear to fit the mathematical model used by DESeq2 (it's called a "gamma-Poisson generalized linear model") and your design matrix
independent filtering: removing genes whose data perfectly well fit the model but that for various reasons appear so unlikely to be detected as differentially expressed that overall (across all genes) detection power is improved by setting them aside

It seems you mean the former.

You can use any method that you like to flag and set aside such genes. I advise visualizing the data from such genes, and some others. With that perhaps you can come up with a programmable rule to automate this task. DESeq2 offers "Cook's distance" as one method to that end; are you sure it was enabled for your analysis?

For more concrete advice, please post a rendered (HTML) Rmarkdown document with your analysis (and plots), and the output of session_info.

Hope this helps

Wolfgang

ADD COMMENT • link 6.7 years ago Wolfgang Huber ★ 13k

0

Entering edit mode

Btw, there is now an (even better) alternative to independent filtering: independent hypothesis weighting, IHW.

ADD REPLY • link 6.7 years ago Wolfgang Huber ★ 13k

0

Entering edit mode

Yeah the former is my focus here.

1) I think one of the problems is that I used LRT-test. Do understand correctly that with LRT the "cooks distance" is not applied in the analysis? I have studied the "Cooks distance"-values but I'm not quite sure what threshold to use with manual filtering - they range from 0-2. Is there any "thumb rules" as to what is regarded as outliers based on Cooks distance?

2) Also studied the basemean as possible parameter to set a threshold as we have some genes that have expression of millions of reads but from plotting it's hard tell weather they are outliers - especially from normalised counts. If there are outliers with abnormally high variance in reads between libraries what would be the best approach to detect these and filter out of the analysis?

With best regards,

Heikki

ADD REPLY • link 6.7 years ago heikki.sarin ▴ 10