I have two questions on the filtering of genes with low counts in differential expression analysis using edgeR:
1. I know that RPKM (or FPKM) values are not suitable for differential expression analysis, but is it also problematic to use RPKM values for filtering, i.e. eliminate genes with low RPKM values and then use the raw counts of the remaining genes for edgeR differential expression analysis?
2. edgeR manual recommends CPM (or TPM) for filtering out genes with low counts and TMM normalisation for the actual differential expression analysis - why not use the same normalisation for both purposes? For example, is it ok to use TMM normalisation for both filtering out genes with low counts and differential expression analysis?
Filtering on RPKMs seems inappropriate, precisely because it accounts for gene length. Consider a very long gene that is expressed at a moderate level. Because of its length, the RPKMs for this gene will generally be low, and it would be removed upon filtering at some RPKM threshold. However, the absolute size of the counts for this gene will (probably) be large. This means that there's plenty of information for dispersion estimation and for DE testing. Removal of this gene by RPKM filtering would not be desirable.
Your question isn't entirely clear. calcNormFactors, as the name suggests, just computes normalization factors from the TMM method. There's no filtering of genes here - at least, not at any level that's accessible by the user. In any case, filtering should be done before using calcNormFactors, to remove low-abundance genes with unreliable M-values. I don't think it's necessary to repeat the filtering step after normalization, even if the effective library sizes have changed in the CPM calculations.