Search
Question: How does edgeR cpm function calculate log(CPM) values?
1
4 months ago by
jxb.dev10
jxb.dev10 wrote:

Hello,

I am knew to R and RNA-seq analysis and I am trying to understand how the cpm function in the edgeR package calculates log2(cpm).

I have a count matrix in a DGEList object and I calculated the counts per million (CPM) and log2(CPM) as follow:

> CPM <- cpm(x)
> logCPM <- cpm(x, log=TRUE, prior.count = 1)

The CPM matrix looks as expected but the logCPM returns negative values, and with a prior.count=1.0 I would expect only positive values.
Indeed if I compute log2(CPM+1), I get completely different values. So what is cpm(x, log=TRUE) returning?

-------------------------------------------------------------- output of sessionInfo() --------------------------------------------------------

> sessionInfo()

R version 3.4.4 (2018-03-15)
Platform: x86_64-apple-darwin15.6.0 (64-bit)
Running under: macOS Sierra 10.12.6

Matrix products: default
BLAS: /Library/Frameworks/R.framework/Versions/3.4/Resources/lib/libRblas.0.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/3.4/Resources/lib/libRlapack.dylib

locale:
[1] en_GB.UTF-8/en_GB.UTF-8/en_GB.UTF-8/C/en_GB.UTF-8/en_GB.UTF-8

attached base packages:
[1] parallel  stats4    stats     graphics  grDevices utils     datasets
[8] methods   base

other attached packages:
[1] edgeR_3.20.9
[2] limma_3.34.9
[3] Homo.sapiens_1.3.1
[4] TxDb.Hsapiens.UCSC.hg19.knownGene_3.2.2
[5] org.Hs.eg.db_3.5.0
[6] GO.db_3.5.0
[7] OrganismDbi_1.20.0
[8] GenomicFeatures_1.30.3
[9] GenomicRanges_1.30.3
[10] GenomeInfoDb_1.14.0
[11] AnnotationDbi_1.40.0
[12] IRanges_2.12.0
[13] S4Vectors_0.16.0
[14] Biobase_2.38.0
[15] BiocGenerics_0.24.0

loaded via a namespace (and not attached):
[1] SummarizedExperiment_1.8.1 pbdZMQ_0.3-2
[3] progress_1.1.2             locfit_1.5-9.1
[5] repr_0.12.0                lattice_0.20-35
[7] rtracklayer_1.38.3         blob_1.1.1
[9] XML_3.98-1.10              RBGL_1.54.0
[11] DBI_0.8                    BiocParallel_1.12.0
[13] bit64_0.9-7                uuid_0.1-2
[15] matrixStats_0.53.1         GenomeInfoDbData_1.0.0
[17] stringr_1.3.0              zlibbioc_1.24.0
[19] Biostrings_2.46.0          memoise_1.1.0
[21] evaluate_0.10.1            biomaRt_2.34.2
[23] BiocInstaller_1.28.0       IRdisplay_0.4.4
[25] Rcpp_0.12.16               IRkernel_0.8.12.9000
[27] DelayedArray_0.4.1         graph_1.56.0
[29] jsonlite_1.5               XVector_0.18.0
[31] bit_1.1-12                 Rsamtools_1.30.0
[33] RMySQL_0.10.14             digest_0.6.15
[35] stringi_1.1.7              grid_3.4.4
[37] tools_3.4.4                bitops_1.0-6
[39] magrittr_1.5               RCurl_1.95-4.10
[41] RSQLite_2.1.0              pkgconfig_2.0.1
[43] crayon_1.3.4               Matrix_1.2-13
[45] prettyunits_1.0.2          assertthat_0.2.0
[47] httr_1.3.1                 R6_2.2.2
[49] GenomicAlignments_1.14.2   compiler_3.4.4 

modified 8 weeks ago by ftroot10100 • written 4 months ago by jxb.dev10
2
4 months ago by
United States
James W. MacDonald47k wrote:

The cpm function computes log2 counts/million, as stated in the help page for the function. But remember that you are computing counts/million counts. If you start with 0 counts, bump that up to 1 with a prior count of 1, and then divide by the total counts for that sample (in millions), unless you have less than a million reads for that sample, you should expect a fractional value.

2

To add to James' comment, see posts from Gordon about the scaling of the prior count within cpm():

Differences between limma voom E values and edgeR cpm values?

I should also mention that computing log2(CPM+1) is, in my opinion, a Bad Idea. Consider an example where I get a CPM of 0.1 in library A and a CPM of 0.2 in library B. This is clearly a 2-fold change between libraries, but after computing the log-CPMs with the above naive formula, I would get values of 0.13 and 0.26 in A and B respectively. Clearly, this result is wrong - on the log-scale, I should see a difference of 1 log unit between these two libraries.

"Well, that's fine," you might say, "because a CPM of 0.1 is pretty low, and the +1 will shrink it towards zero prior to log-transformation, thus stabilizing any log-fold changes between libraries in the presence of limited information." But is a CPM of 0.1 really that low? If my libraries contained 100 million reads, a CPM of 0.1 and 0.2 would correspond to counts of 10 and 20, respectively. That's more than enough information to tell me that I have a 2-fold change between A and B. Indeed, cpm() will respect this and report log-CPMs that do, in fact, differ by ~1 log unit:

cpm(cbind(10, 20), lib.size=rep(100e6, 2), prior.count=1, log=TRUE)
##           [,1]      [,2]
## [1,] -3.184425 -2.251539

In short, it makes more sense to apply the prior count on the scale of, well, the counts, rather than the CPMs.

ADD REPLYlink modified 4 months ago • written 4 months ago by Aaron Lun20k
0
8 weeks ago by
ftroot10100 wrote:

Hi,

I had the same question, and I didn't find the the math definition of cpm function in edgeR. According to the details of the function, "If log-values are computed, then a small count, given by prior.count but scaled to be proportional to the library size, is added to y to avoid taking the log of zero." So I tried many possibilities and finally got something maybe right. The codes and detail are showed below.

I used TMM for normalization, while you can ignore that and keep normalization factors (x$samples$norm.factors) = 1. I keep that for answering the problem a little more generally.

> x = calcNormFactors(x, method='TMM') # Just skip this line if don't need to normalization by TMM

Calculate the cpm with two different parameters.

> prior.count = 1
> cpm_with_log2 = cpm(x, log=T, prior.count=prior.count)
> cpm_without_log2 = cpm(x, log=F)

Calculate the normalized library sizes.

> nlib = x$samples$lib.size * x$samples$norm.factors

So the cpm_without_log2 is same as

> cpm_without_log2_by_hand = t( t(x$counts) / nlib * 1e6 ) > cpm_without_log2[1:5, 1:5]; cpm_without_log2_by_hand[1:5, 1:5] # Take a look and compare And I found that the cpm_with_log2 can be calculated from > prior.count.scaled = prior.count * length(nlib) * nlib / sum(nlib) > cpm_with_log2_by_hand = t(log2( (t(tempData$counts)+prior.count.scaled) / (nlib +  2*prior.count.scaled) * 1e+06 ))
> cpm_with_log2[1:5, 1:5]; cpm_with_log2_by_hand[1:5, 1:5] # Take a look and compare

By observing the different between cpm_without_log2_by_hand and cpm_with_log2_by_hand, we can transfer cpm_without_log2 to cpm_with_log2 by

> cpm_with_log2_two_step = t(log2( (t(cpm_without_log2/1e6) * nlib + prior.count.scaled) / (nlib+2*prior.count.scaled) *1e6 ))
> cpm_with_log2[1:5, 1:5]; cpm_with_log2_two_step [1:5, 1:5] # Take a look and compare

Notice that if you compare these two matrix with

> all( cpm_with_log2_two_step == cpm_with_log2 )

It must get FALSE. While I consider it as round-off error, because of the little bias:

> sum( abs(cpm_with_log2_two_step - cpm_with_log2 ))

While the calculation above is just my guess, so maybe it's wrong.

Perhaps the documentation for addPriorCount may be informative regarding the prior count addition strategy.