How does edgeR cpm function calculate log(CPM) values?
2
4
Entering edit mode
jxb.dev ▴ 40
@jxbdev-15468
Last seen 6.1 years ago

Hello,

I am knew to R and RNA-seq analysis and I am trying to understand how the cpm function in the edgeR package calculates log2(cpm).

I have a count matrix in a DGEList object and I calculated the counts per million (CPM) and log2(CPM) as follow:

> CPM <- cpm(x)
> logCPM <- cpm(x, log=TRUE, prior.count = 1)

The CPM matrix looks as expected but the logCPM returns negative values, and with a prior.count=1.0 I would expect only positive values.
Indeed if I compute log2(CPM+1), I get completely different values. So what is cpm(x, log=TRUE) returning?


-------------------------------------------------------------- output of sessionInfo() --------------------------------------------------------

> sessionInfo()

R version 3.4.4 (2018-03-15)
Platform: x86_64-apple-darwin15.6.0 (64-bit)
Running under: macOS Sierra 10.12.6

Matrix products: default
BLAS: /Library/Frameworks/R.framework/Versions/3.4/Resources/lib/libRblas.0.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/3.4/Resources/lib/libRlapack.dylib

locale:
[1] en_GB.UTF-8/en_GB.UTF-8/en_GB.UTF-8/C/en_GB.UTF-8/en_GB.UTF-8

attached base packages:
[1] parallel  stats4    stats     graphics  grDevices utils     datasets 
[8] methods   base     

other attached packages:
 [1] edgeR_3.20.9                           
 [2] limma_3.34.9                           
 [3] Homo.sapiens_1.3.1                     
 [4] TxDb.Hsapiens.UCSC.hg19.knownGene_3.2.2
 [5] org.Hs.eg.db_3.5.0                     
 [6] GO.db_3.5.0                            
 [7] OrganismDbi_1.20.0                     
 [8] GenomicFeatures_1.30.3                 
 [9] GenomicRanges_1.30.3                   
[10] GenomeInfoDb_1.14.0                    
[11] AnnotationDbi_1.40.0                   
[12] IRanges_2.12.0                         
[13] S4Vectors_0.16.0                       
[14] Biobase_2.38.0                         
[15] BiocGenerics_0.24.0                    

loaded via a namespace (and not attached):
 [1] SummarizedExperiment_1.8.1 pbdZMQ_0.3-2              
 [3] progress_1.1.2             locfit_1.5-9.1            
 [5] repr_0.12.0                lattice_0.20-35           
 [7] rtracklayer_1.38.3         blob_1.1.1                
 [9] XML_3.98-1.10              RBGL_1.54.0               
[11] DBI_0.8                    BiocParallel_1.12.0       
[13] bit64_0.9-7                uuid_0.1-2                
[15] matrixStats_0.53.1         GenomeInfoDbData_1.0.0    
[17] stringr_1.3.0              zlibbioc_1.24.0           
[19] Biostrings_2.46.0          memoise_1.1.0             
[21] evaluate_0.10.1            biomaRt_2.34.2            
[23] BiocInstaller_1.28.0       IRdisplay_0.4.4           
[25] Rcpp_0.12.16               IRkernel_0.8.12.9000      
[27] DelayedArray_0.4.1         graph_1.56.0              
[29] jsonlite_1.5               XVector_0.18.0            
[31] bit_1.1-12                 Rsamtools_1.30.0          
[33] RMySQL_0.10.14             digest_0.6.15             
[35] stringi_1.1.7              grid_3.4.4                
[37] tools_3.4.4                bitops_1.0-6              
[39] magrittr_1.5               RCurl_1.95-4.10           
[41] RSQLite_2.1.0              pkgconfig_2.0.1           
[43] crayon_1.3.4               Matrix_1.2-13             
[45] prettyunits_1.0.2          assertthat_0.2.0          
[47] httr_1.3.1                 R6_2.2.2                  
[49] GenomicAlignments_1.14.2   compiler_3.4.4 

 

edgeR cpm • 29k views
ADD COMMENT
4
Entering edit mode
@james-w-macdonald-5106
Last seen 17 hours ago
United States

The cpm function computes log2 counts/million, as stated in the help page for the function. But remember that you are computing counts/million counts. If you start with 0 counts, bump that up to 1 with a prior count of 1, and then divide by the total counts for that sample (in millions), unless you have less than a million reads for that sample, you should expect a fractional value. 

ADD COMMENT
5
Entering edit mode

To add to James' comment, see posts from Gordon about the scaling of the prior count within cpm():

Differences between limma voom E values and edgeR cpm values?

I should also mention that computing log2(CPM+1) is, in my opinion, a Bad Idea. Consider an example where I get a CPM of 0.1 in library A and a CPM of 0.2 in library B. This is clearly a 2-fold change between libraries, but after computing the log-CPMs with the above naive formula, I would get values of 0.13 and 0.26 in A and B respectively. Clearly, this result is wrong - on the log-scale, I should see a difference of 1 log unit between these two libraries.

"Well, that's fine," you might say, "because a CPM of 0.1 is pretty low, and the +1 will shrink it towards zero prior to log-transformation, thus stabilizing any log-fold changes between libraries in the presence of limited information." But is a CPM of 0.1 really that low? If my libraries contained 100 million reads, a CPM of 0.1 and 0.2 would correspond to counts of 10 and 20, respectively. That's more than enough information to tell me that I have a 2-fold change between A and B. Indeed, cpm() will respect this and report log-CPMs that do, in fact, differ by ~1 log unit:

cpm(cbind(10, 20), lib.size=rep(100e6, 2), prior.count=1, log=TRUE)
##           [,1]      [,2]
## [1,] -3.184425 -2.251539

In short, it makes more sense to apply the prior count on the scale of, well, the counts, rather than the CPMs.

ADD REPLY
0
Entering edit mode

actually log-CPM adds an offset of 1/L where 1 is the “prior count” and L is the average library size in millions. So the log-CPM values are related to the CPM values by log2(CPM + 1/L). L <- mean(x$samples$lib.size) * 1e-6

ADD REPLY
0
Entering edit mode

Close, but not quite. You cannot derive the edgeR log-CPMs from the naively computed CPM, because the library size is increased by twice the scaled prior count (see ?addPriorCount for more details).

ADD REPLY
2
Entering edit mode
ftroot1010 ▴ 20
@ftroot1010-16166
Last seen 5.8 years ago

Hi,

I had the same question, and I didn't find the the math definition of cpm function in edgeR. According to the details of the function, "If log-values are computed, then a small count, given by prior.count but scaled to be proportional to the library size, is added to y to avoid taking the log of zero." So I tried many possibilities and finally got something maybe right. The codes and detail are showed below.

I used TMM for normalization, while you can ignore that and keep normalization factors (x$samples$norm.factors) = 1. I keep that for answering the problem a little more generally.

> x = calcNormFactors(x, method='TMM') # Just skip this line if don't need to normalization by TMM

Calculate the cpm with two different parameters.

> prior.count = 1
> cpm_with_log2 = cpm(x, log=T, prior.count=prior.count)
> cpm_without_log2 = cpm(x, log=F)

Calculate the normalized library sizes.

> nlib = x$samples$lib.size * x$samples$norm.factors

So the cpm_without_log2 is same as

> cpm_without_log2_by_hand = t( t(x$counts) / nlib * 1e6 )
> cpm_without_log2[1:5, 1:5]; cpm_without_log2_by_hand[1:5, 1:5] # Take a look and compare

And I found that the cpm_with_log2 can be calculated from

> prior.count.scaled = prior.count * length(nlib) * nlib / sum(nlib)
> cpm_with_log2_by_hand = t(log2( (t(tempData$counts)+prior.count.scaled) / (nlib +  2*prior.count.scaled) * 1e+06 ))
> cpm_with_log2[1:5, 1:5]; cpm_with_log2_by_hand[1:5, 1:5] # Take a look and compare

By observing the different between cpm_without_log2_by_hand and cpm_with_log2_by_hand, we can transfer cpm_without_log2 to cpm_with_log2 by

> cpm_with_log2_two_step = t(log2( (t(cpm_without_log2/1e6) * nlib + prior.count.scaled) / (nlib+2*prior.count.scaled) *1e6 ))
> cpm_with_log2[1:5, 1:5]; cpm_with_log2_two_step [1:5, 1:5] # Take a look and compare

Notice that if you compare these two matrix with

> all( cpm_with_log2_two_step == cpm_with_log2 )

It must get FALSE. While I consider it as round-off error, because of the little bias:

> sum( abs(cpm_with_log2_two_step - cpm_with_log2 ))

 

While the calculation above is just my guess, so maybe it's wrong.

ADD COMMENT
0
Entering edit mode

Perhaps the documentation for addPriorCount may be informative regarding the prior count addition strategy.

ADD REPLY

Login before adding your answer.

Traffic: 602 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6