Entering edit mode
Pauly Lin ▴ 120
Last seen 7.0 years ago
University of New South Wales, Australia

Dear all, 

I have two questions on the filtering of genes with low counts in differential expression analysis using edgeR:

1. I know that RPKM (or FPKM) values are not suitable for differential expression analysis, but is it also problematic to use RPKM values for filtering, i.e. eliminate genes with low RPKM values and then use the raw counts of the remaining genes for edgeR differential expression analysis?

2. edgeR manual recommends CPM (or TPM) for filtering out genes with low counts and TMM normalisation for the actual differential expression analysis - why not use the same normalisation for both purposes? For example, is it ok to use TMM normalisation for both filtering out genes with low counts and differential expression analysis?



edger rnaseq differential expression • 26k views
Entering edit mode
Aaron Lun ★ 27k
Last seen 1 hour ago
The city by the bay
  1. Filtering on RPKMs seems inappropriate, precisely because it accounts for gene length. Consider a very long gene that is expressed at a moderate level. Because of its length, the RPKMs for this gene will generally be low, and it would be removed upon filtering at some RPKM threshold. However, the absolute size of the counts for this gene will (probably) be large. This means that there's plenty of information for dispersion estimation and for DE testing. Removal of this gene by RPKM filtering would not be desirable.
  2. Your question isn't entirely clear. calcNormFactors, as the name suggests, just computes normalization factors from the TMM method. There's no filtering of genes here - at least, not at any level that's accessible by the user. In any case, filtering should be done before using calcNormFactors, to remove low-abundance genes with unreliable M-values. I don't think it's necessary to repeat the filtering step after normalization, even if the effective library sizes have changed in the CPM calculations.
Entering edit mode

Thanks for the prompt response, Aaron. I'm very clear about RPKM now. With regard to my second question, I was referring to the linked edgeR manual below. In section 4.3.6, CPM normalisation is used for removing low abundance genes, and then in section 4.3.7 the raw counts of the remaining genes go through TMM normalisation before differential expression analysis. I was wondering why the two steps involve two different types of normalisations rather than just one. 


By the way, is there a good rule of thumb for choosing CPM threshold for removing low abundance genes?

Entering edit mode

The two steps refer to different aspects of normalization. CPM "normalization" accounts for library size differences between samples, and produces normalized values that can be compared on an absolute scale (e.g., for filtering). TMM normalization accounts for composition bias, and computes normalization factors for comparing between libraries on a relative scale. CPM normalization doesn't account for composition bias, and TMM normalization doesn't produce normalized values. Thus, you need both steps in the analysis pipeline. This isn't a problem, as the two steps aren't really redundant.

As for your second question; in general, we pick the CPM threshold to get rid of genes with low counts across all samples. The exact definition of "low" will vary between analyses. We think that an absolute count of 5 to 10 is pretty low; for such genes, there won't be enough evidence to reject the null hypothesis in most data sets with limited sample sizes. Of course, the corresponding CPM threshold will depend on the size of your libraries. This should be trivial to compute.

Entering edit mode

Thanks, Aaron! It's very helpful!



Login before adding your answer.

Traffic: 337 users visited in the last hour
Help About
Access RSS

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6