How many 'expressed genes' do I have in my dataset?
Entering edit mode
Last seen 8.5 years ago
European Union


I'm trying to assess the significance of the overlap between groups of genes.

To do this I need the number of genes 'expressed' in my samples - I've read people use an FPKM value >1 as a rough cutoff for this, which gives me 10-12,000 genes depending on the sample.

Does anyone know an equivalent cutoff using the baseMean output from DESeq2? I'm trying to keep number of 'expressed genes' consistent with the analysis it derives from.

Thanks very much!


deseq2 cuffdiff • 1.5k views
Entering edit mode
Last seen 6 hours ago
United States

hi Alex,

DESeq2 does have an fpkm() function, which works automatically if you've used summarizeOverlaps to construct the counts, or if you can add the gene length information (see ?fpkm). This divides the normalized counts for each gene by the union of the exonic basepairs which were used for counting. This is important because the normalized counts are proportional to gene expression as well as gene length (and other factors), so you want to divide out the gene length to get closer to something like expression.

You can also consider Bioconductor software like cqn or EDASeq, which additionally will correct for sample-specific gene length and GC-content curves. See the vignettes of those packages for details.

I don't have any recommendation on a generic cutoff though for expressed/not expressed.

Entering edit mode

I would add that no such generic "expressed/not expressed" cutoff exists for all experiments, because compositional biases mean that the appropriate threshold would be different for every dataset. (I.e. the same reason that you need to estimate size factors in order to compare between samples).


Login before adding your answer.

Traffic: 477 users visited in the last hour
Help About
Access RSS

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6