DESeq2: How to manually perform filtering and outlier detection?
1
0
Entering edit mode
heikki.sarin ▴ 10
@heikkisarin-13379
Last seen 3.3 years ago

Hi,

I'm quite sure I have a problem with the DESeq2 independent filtering. DESeq2 doesn't flag any outliers when performing the DE analysis. We have little concern about blood being in the samples as the raw counts vary from hundreds to tens of thousands in some HB related genes. For some reason DESeq2 doesn't get rid of these high count outlier genes which I think should appropriate. Or is possible for some genes to have such big variation in gene expression?

We have filtered out lowly expressed genes before the analysis but I think more filtering should be maybe manually done to get these high count outlier genes removed. When plotting the Q-Q-plot of p-values the trend a bit inflated which is a a slight concern also. 

Help would be really appreciated because not quite sure how to approach this problem.

deseq2 filtering outliers count outliers • 5.1k views
ADD COMMENT
1
Entering edit mode
@wolfgang-huber-3550
Last seen 9 days ago
EMBL European Molecular Biology Laborat…

Heikki

First, it'll be good to clarify concepts regarding

  • outlier removal: removing genes whose data do not appear to fit the mathematical model used by DESeq2 (it's called a "gamma-Poisson generalized linear model") and your design matrix
  • independent filtering: removing genes whose data perfectly well fit the model but that for various reasons appear so unlikely to be detected as differentially expressed that overall (across all genes) detection power is improved by setting them aside

It seems you mean the former.

You can use any method that you like to flag and set aside such genes. I advise visualizing the data from such genes, and some others. With that perhaps you can come up with a programmable rule to automate this task. DESeq2 offers "Cook's distance" as one method to that end; are you sure it was enabled for your analysis?

For more concrete advice, please post a rendered (HTML) Rmarkdown document with your analysis (and plots), and the output of session_info.

Hope this helps

Wolfgang

ADD COMMENT
0
Entering edit mode

Btw, there is now an (even better) alternative to independent filtering: independent hypothesis weighting, IHW.

ADD REPLY
0
Entering edit mode

Yeah the former is my focus here.

1) I think one of the problems is that I used LRT-test. Do understand correctly that with LRT the "cooks distance" is not applied in the analysis? I have studied the "Cooks distance"-values but I'm not quite sure what threshold to use with manual filtering - they range from 0-2. Is there any "thumb rules" as to what is regarded as outliers based on Cooks distance?

2) Also studied the basemean as possible parameter to set a threshold as we have some genes that have expression of millions of reads but from plotting it's hard tell weather they are outliers - especially from normalised counts. If there are outliers with abnormally high variance in reads between libraries what would be the best approach to detect these and filter out of the analysis?

With best regards,

Heikki

ADD REPLY

Login before adding your answer.

Traffic: 814 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6