Remove genes with very low counts for all samples or let DESeq2 perform independent filtering?
1
0
Entering edit mode
colaneri ▴ 30
@colaneri-7770
Last seen 2.8 years ago
United States

I’m following the DESeq2 tutorial to perform DGE analysis. I noticed that before run

Dds <- DESeq(dds)

it is recommended to remove genes whose counts are 0 for all the samples. I have question about this step:

1. Why just genes with 0 counts in all samples? What about genes that add up a total of 5 counts considering all the samples? And 10? Which will be a reasonable threshold? I’m sure that for many experienced people doing DGE it should be a number that is sounded as a correct a safe threshold.  I will like to have some advice regarding this question.
2. I understand that DESeq2 perform independent filtering, and that for this purpose it identify a threshold base in counts and remove genes that given the counts cannot produce a trustable result. My question is: why to bother to perform the above step if these genes are going to be filtered any way.
deseq2 independent filtering lowcountgenes • 6.0k views
1
Entering edit mode
@mikelove
Last seen 12 minutes ago
United States

from the vignette:

vignette("DESeq2")

"1.3.5 Pre-filtering
While it is not necessary to pre-filter low count genes before running the DESeq2 functions, there are two reasons which make pre-filtering useful: by removing rows in which there are no reads or nearly no reads, we reduce the memory size of the dds data object and we increase the speed of the transformation and testing functions within DESeq2. "

You don't have to filter at all though. The safest threshold would be to not filter anything above row sum of 0, and just let the data-driven software (which lives in the genefilter package, outside of DESeq2) choose the threshold that maximizes power. For more details you can read the citation for the genefilter package which is also referenced in the DESeq2 paper section on independent filtering.

0
Entering edit mode

Interesting.

On our dataset of 15 samples, we were removing the rows in which the average count was 1 or less.  We even thought about removing the rows in which the average count was 2 or less.

Are you saying it is better to not remove these lines at all?  Even the lines where the counts are zero for all samples?

1
Entering edit mode

I didn't say it was better. It should make no difference. You could increase the threshold even higher and it will begin to increase sensitivity make no difference up to a point at which you will be filtering too much (which will be different for each experiment).

The question was: what is a safe / reasonable threshold that will work for all experiments. And our recommendation (and the default in DESeq2) is to let the genefilter software optimize the threshold, such that sensitivity (statistical power) is maximized.

There is a separate reference for genefilter if you want to read about this. Also there is a new approach from Wolfgang's group: https://www.bioconductor.org/packages/IHW