Question

Remove genes with very low counts for all samples or let DESeq2 perform independent filtering?

0

Entering edit mode

colaneri ▴ 30

@colaneri-7770

Last seen 5.0 years ago

United States

I’m following the DESeq2 tutorial to perform DGE analysis. I noticed that before run

Dds <- DESeq(dds)

it is recommended to remove genes whose counts are 0 for all the samples. I have question about this step:

Why just genes with 0 counts in all samples? What about genes that add up a total of 5 counts considering all the samples? And 10? Which will be a reasonable threshold? I’m sure that for many experienced people doing DGE it should be a number that is sounded as a correct a safe threshold. I will like to have some advice regarding this question.
I understand that DESeq2 perform independent filtering, and that for this purpose it identify a threshold base in counts and remove genes that given the counts cannot produce a trustable result. My question is: why to bother to perform the above step if these genes are going to be filtered any way.

deseq2 independent filtering lowcountgenes • 9.5k views

ADD COMMENT • link updated 8.1 years ago by Michael Love 41k • written 8.1 years ago by colaneri ▴ 30

score 1 · Answer 1 · 2016-03-19

1

Entering edit mode

Michael Love 41k

@mikelove

Last seen 15 hours ago

United States

from the vignette:

vignette("DESeq2")

"1.3.5 Pre-filtering
While it is not necessary to pre-filter low count genes before running the DESeq2 functions, there are two reasons which make pre-filtering useful: by removing rows in which there are no reads or nearly no reads, we reduce the memory size of the dds data object and we increase the speed of the transformation and testing functions within DESeq2. "

You don't have to filter at all though. The safest threshold would be to not filter anything above row sum of 0, and just let the data-driven software (which lives in the genefilter package, outside of DESeq2) choose the threshold that maximizes power. For more details you can read the citation for the genefilter package which is also referenced in the DESeq2 paper section on independent filtering.

ADD COMMENT • link 8.1 years ago Michael Love 41k

0

Entering edit mode

Interesting.

On our dataset of 15 samples, we were removing the rows in which the average count was 1 or less. We even thought about removing the rows in which the average count was 2 or less.

Are you saying it is better to not remove these lines at all? Even the lines where the counts are zero for all samples?

ADD REPLY • link 8.1 years ago Marcelo Pereira ▴ 70

1

Entering edit mode

I didn't say it was better. It should make no difference. You could increase the threshold even higher and it will begin to increase sensitivity ~~make no difference~~ up to a point at which you will be filtering too much (which will be different for each experiment).

The question was: what is a safe / reasonable threshold that will work for all experiments. And our recommendation (and the default in DESeq2) is to let the genefilter software optimize the threshold, such that sensitivity (statistical power) is maximized.

There is a separate reference for genefilter if you want to read about this. Also there is a new approach from Wolfgang's group: https://www.bioconductor.org/packages/IHW

ADD REPLY • link 8.1 years ago Michael Love 41k