Why edgeR::aveLogCPM() is simlar to log2(rowMeans(cpm)) instead of rowMeans(cpm(y, log=TRUE))
1
1
Entering edit mode
Robert Castelo ★ 2.7k
@rcastelo
Last seen 4 weeks ago
Barcelona/Universitat Pompeu Fabra

hi,

the documentation of the aveLogCPM() function from edgeR says that this function is similar to:

log2(rowMeans(cpm(y, ...)))

because CPM values may be more disperse than logCPM values and the arithmetic mean is sensitive to outliers, would not make more sense that aveLogCPM() resembles rowMeans(cpm(y, log=TRUE, ...))) instead?

In the old microarray days one would take the average of the log-fluorescent units to look at average expression, so I had expected that aveLogCPM() would also take the mean value over the logarithmic scale. I guess there's a good reason for this but could not figure it out from the documentation.

thanks!!

robert.

edgeR rnaseq limma independent filtering • 794 views
5
Entering edit mode
Aaron Lun ★ 27k
@alun
Last seen 1 hour ago
The city by the bay

No, because the aveLogCPM function is based on the mean for each gene under a negative binomial model, which naturally considers counts on the raw scale. Specifically, the count for each sample i is assumed to be sampled from a negative binomial distribution with mean equal to x*lib_i where lib_i is the effective library size for the sample and x is a gene-specific value representing the ratio of the average expression to the library size. This is multiplied by one million to get the average CPM, which is then log-transformed.

As a result, the output of aveLogCPM is closer to

log2(rowMeans(cpm(y))


... where the log-transformation is done after the averaging, than to

rowMeans(cpm(y, log=TRUE))


... where the log-transformation is done before the averaging. It makes sense to compute the average on the count scale first, given that the rest of the edgeR analysis is based on the negative binomial model with the original counts. Indeed, the use of the NB mean is quite important for independent filtering, based on empirical performance of mean-based filters and analogy to normal models.

0
Entering edit mode

thanks! does the use of the NB mean for filtering lowly-expressed genes still applies if, instead of edgeR, i use the limma-voom pipeline for differential expression?

2
Entering edit mode

If you're fitting a linear model to log-expression values, the independent filter statistic would technically be the mean log-expression across samples. However, I wouldn't worry about it; indeed, the edgeR user's guide describes different filtering strategies altogether. This is because the choice of filter statistic doesn't matter much for RNA-seq, as the density of genes at the filter boundary is low (i.e., different filtering strategies yield similar sets of retained genes). By comparison, filtering requires more care in applications such as ChIP-seq, where high-abundance regions are less frequent. The choice of filter will have a much greater effect on the analysis, simply because there's more low-abundance regions that are affected by the behaviour of the filter statistic.