Question: How does edgeR cpm function calculate log(CPM) values?
gravatar for
4 months ago by
jxb.dev10 wrote:


I am knew to R and RNA-seq analysis and I am trying to understand how the cpm function in the edgeR package calculates log2(cpm).

I have a count matrix in a DGEList object and I calculated the counts per million (CPM) and log2(CPM) as follow:

> CPM <- cpm(x)
> logCPM <- cpm(x, log=TRUE, prior.count = 1)

The CPM matrix looks as expected but the logCPM returns negative values, and with a prior.count=1.0 I would expect only positive values.
Indeed if I compute log2(CPM+1), I get completely different values. So what is cpm(x, log=TRUE) returning?

-------------------------------------------------------------- output of sessionInfo() --------------------------------------------------------

> sessionInfo()

R version 3.4.4 (2018-03-15)
Platform: x86_64-apple-darwin15.6.0 (64-bit)
Running under: macOS Sierra 10.12.6

Matrix products: default
BLAS: /Library/Frameworks/R.framework/Versions/3.4/Resources/lib/libRblas.0.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/3.4/Resources/lib/libRlapack.dylib

[1] en_GB.UTF-8/en_GB.UTF-8/en_GB.UTF-8/C/en_GB.UTF-8/en_GB.UTF-8

attached base packages:
[1] parallel  stats4    stats     graphics  grDevices utils     datasets 
[8] methods   base     

other attached packages:
 [1] edgeR_3.20.9                           
 [2] limma_3.34.9                           
 [3] Homo.sapiens_1.3.1                     
 [4] TxDb.Hsapiens.UCSC.hg19.knownGene_3.2.2
 [6] GO.db_3.5.0                            
 [7] OrganismDbi_1.20.0                     
 [8] GenomicFeatures_1.30.3                 
 [9] GenomicRanges_1.30.3                   
[10] GenomeInfoDb_1.14.0                    
[11] AnnotationDbi_1.40.0                   
[12] IRanges_2.12.0                         
[13] S4Vectors_0.16.0                       
[14] Biobase_2.38.0                         
[15] BiocGenerics_0.24.0                    

loaded via a namespace (and not attached):
 [1] SummarizedExperiment_1.8.1 pbdZMQ_0.3-2              
 [3] progress_1.1.2             locfit_1.5-9.1            
 [5] repr_0.12.0                lattice_0.20-35           
 [7] rtracklayer_1.38.3         blob_1.1.1                
 [9] XML_3.98-1.10              RBGL_1.54.0               
[11] DBI_0.8                    BiocParallel_1.12.0       
[13] bit64_0.9-7                uuid_0.1-2                
[15] matrixStats_0.53.1         GenomeInfoDbData_1.0.0    
[17] stringr_1.3.0              zlibbioc_1.24.0           
[19] Biostrings_2.46.0          memoise_1.1.0             
[21] evaluate_0.10.1            biomaRt_2.34.2            
[23] BiocInstaller_1.28.0       IRdisplay_0.4.4           
[25] Rcpp_0.12.16               IRkernel_0.8.12.9000      
[27] DelayedArray_0.4.1         graph_1.56.0              
[29] jsonlite_1.5               XVector_0.18.0            
[31] bit_1.1-12                 Rsamtools_1.30.0          
[33] RMySQL_0.10.14             digest_0.6.15             
[35] stringi_1.1.7              grid_3.4.4                
[37] tools_3.4.4                bitops_1.0-6              
[39] magrittr_1.5               RCurl_1.95-4.10           
[41] RSQLite_2.1.0              pkgconfig_2.0.1           
[43] crayon_1.3.4               Matrix_1.2-13             
[45] prettyunits_1.0.2          assertthat_0.2.0          
[47] httr_1.3.1                 R6_2.2.2                  
[49] GenomicAlignments_1.14.2   compiler_3.4.4 


ADD COMMENTlink modified 8 weeks ago by ftroot10100 • written 4 months ago by jxb.dev10
gravatar for James W. MacDonald
4 months ago by
United States
James W. MacDonald47k wrote:

The cpm function computes log2 counts/million, as stated in the help page for the function. But remember that you are computing counts/million counts. If you start with 0 counts, bump that up to 1 with a prior count of 1, and then divide by the total counts for that sample (in millions), unless you have less than a million reads for that sample, you should expect a fractional value. 

ADD COMMENTlink written 4 months ago by James W. MacDonald47k

To add to James' comment, see posts from Gordon about the scaling of the prior count within cpm():

Differences between limma voom E values and edgeR cpm values?

I should also mention that computing log2(CPM+1) is, in my opinion, a Bad Idea. Consider an example where I get a CPM of 0.1 in library A and a CPM of 0.2 in library B. This is clearly a 2-fold change between libraries, but after computing the log-CPMs with the above naive formula, I would get values of 0.13 and 0.26 in A and B respectively. Clearly, this result is wrong - on the log-scale, I should see a difference of 1 log unit between these two libraries.

"Well, that's fine," you might say, "because a CPM of 0.1 is pretty low, and the +1 will shrink it towards zero prior to log-transformation, thus stabilizing any log-fold changes between libraries in the presence of limited information." But is a CPM of 0.1 really that low? If my libraries contained 100 million reads, a CPM of 0.1 and 0.2 would correspond to counts of 10 and 20, respectively. That's more than enough information to tell me that I have a 2-fold change between A and B. Indeed, cpm() will respect this and report log-CPMs that do, in fact, differ by ~1 log unit:

cpm(cbind(10, 20), lib.size=rep(100e6, 2), prior.count=1, log=TRUE)
##           [,1]      [,2]
## [1,] -3.184425 -2.251539

In short, it makes more sense to apply the prior count on the scale of, well, the counts, rather than the CPMs.

ADD REPLYlink modified 4 months ago • written 4 months ago by Aaron Lun20k
gravatar for ftroot1010
8 weeks ago by
ftroot10100 wrote:


I had the same question, and I didn't find the the math definition of cpm function in edgeR. According to the details of the function, "If log-values are computed, then a small count, given by prior.count but scaled to be proportional to the library size, is added to y to avoid taking the log of zero." So I tried many possibilities and finally got something maybe right. The codes and detail are showed below.

I used TMM for normalization, while you can ignore that and keep normalization factors (x$samples$norm.factors) = 1. I keep that for answering the problem a little more generally.

> x = calcNormFactors(x, method='TMM') # Just skip this line if don't need to normalization by TMM

Calculate the cpm with two different parameters.

> prior.count = 1
> cpm_with_log2 = cpm(x, log=T, prior.count=prior.count)
> cpm_without_log2 = cpm(x, log=F)

Calculate the normalized library sizes.

> nlib = x$samples$lib.size * x$samples$norm.factors

So the cpm_without_log2 is same as

> cpm_without_log2_by_hand = t( t(x$counts) / nlib * 1e6 )
> cpm_without_log2[1:5, 1:5]; cpm_without_log2_by_hand[1:5, 1:5] # Take a look and compare

And I found that the cpm_with_log2 can be calculated from

> prior.count.scaled = prior.count * length(nlib) * nlib / sum(nlib)
> cpm_with_log2_by_hand = t(log2( (t(tempData$counts)+prior.count.scaled) / (nlib +  2*prior.count.scaled) * 1e+06 ))
> cpm_with_log2[1:5, 1:5]; cpm_with_log2_by_hand[1:5, 1:5] # Take a look and compare

By observing the different between cpm_without_log2_by_hand and cpm_with_log2_by_hand, we can transfer cpm_without_log2 to cpm_with_log2 by

> cpm_with_log2_two_step = t(log2( (t(cpm_without_log2/1e6) * nlib + prior.count.scaled) / (nlib+2*prior.count.scaled) *1e6 ))
> cpm_with_log2[1:5, 1:5]; cpm_with_log2_two_step [1:5, 1:5] # Take a look and compare

Notice that if you compare these two matrix with

> all( cpm_with_log2_two_step == cpm_with_log2 )

It must get FALSE. While I consider it as round-off error, because of the little bias:

> sum( abs(cpm_with_log2_two_step - cpm_with_log2 ))


While the calculation above is just my guess, so maybe it's wrong.

ADD COMMENTlink modified 8 weeks ago • written 8 weeks ago by ftroot10100

Perhaps the documentation for addPriorCount may be informative regarding the prior count addition strategy.

ADD REPLYlink written 8 weeks ago by Aaron Lun20k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.2.0
Traffic: 123 users visited in the last hour