Question

cpm filtering edgeR

0

Entering edit mode

es874 ▴ 20

@es874-11802

Last seen 9.0 years ago

I have read the edgeR User's Guide, and Section 2.6 recommends setting a threshold of at least 5-10 absolute counts in a library to be considered expressed. If I set 10 counts as my threshold for expression, my calculated cpm becomes 0.8 based on the smallest library size of my sample set:

group lib.size norm.factors
SM01          T0 12559041            1

0.8 = ((10 x 1,000,000)/12559041) Is this correct? Am I being too stringent using 0.8 instead of 1?

Also, in what instances would the recommended 5-10 counts change (+ or -)?

Thanks

edger • 1.2k views

ADD COMMENT • link updated 8.9 years ago by Aaron Lun ★ 29k • written 9.0 years ago by es874 ▴ 20

score 2 · Accepted Answer · 2016-12-12

First, your calculation looks fine to me. You could probably just use a threshold of 1 instead, it's not that much of a difference. But if you've taken the time to work it out, you might as well use 0.8.

The 5-10 recommendation seems to do well in a variety of situations (for routine RNA-seq, at least). The issue is that, at counts lower than 5, we get problems with discreteness and some statistical approximations become inaccurate. On the other hand, we don't want to increase the filter beyond 10, because we might start filtering out interesting genes. So the 5-10 choice represents a compromise between these two considerations.

Lower thresholds are sometimes used when you're explicitly interested in low-abundance genes, e.g., certain ncRNAs, repeat elements that don't get a lot of reads. Higher thresholds are used in other applications like ChIP-seq, where there is a certain level of background enrichment and we need to set the filter above that background. This gets rid of uninteresting genomic regions, even if they have large read counts.