Choosing a threshold for minimum counts in RNAseq
@laianavarromartin-7750
Hi,

I was wondering which is the criteria to set the minimum counts that a gene is considered for further analysis when doing DE between control and treated samples. Is there any way to change this threshold to be more stringent? One I analyze my data set, if I use the default setting I get 1400 genes that are DE. However if I delete all genes that have an average count < 10 from the count matrix I do only get 300 DE genes. Is there an arbitrary way to select the genes that have a reasonable minimun counts?

Thanks!

Laia

@ryan-c-thompson-5618
I generally look at a histogram of average logCPM values. Typically this is a bimodal distribution, with a low-CPM peak representing non-expressed genes and and a high-CPM peak representing expressed genes. I choose an appropriate filtering threshold between the two peaks. You can see an example in this document: https://cdn.rawgit.com/DarwinAwardWinner/resume/master/examples/Salomon/Teaching/RNA-Seq%20Lab.html. (See the section "Filtering non-expressed genes")

@mikelove
Filtering on the mean of normalized counts to obtain an optimal threshold is performed automatically in DESeq2 within the results function.

This is discussed in the vignette. Best to take a look over the vignette, as most user questions have already been addressed there:

vignette("DESeq2")
Björn
@bjorn-12199
Hi Ryan, your website is really informative. However, it is not clear how to choose CPM value

As I said in my answer, I choose a threshold that lies between the low-logCPM mode and the high-logCPM mode in the logCPM histogram plot. Generally the precise choice of threshold is not important, as long as you choose one in the trough between the two modes.