Hello,
I am knew to R and RNA-seq analysis and I am trying to understand how the cpm function in the edgeR package calculates log2(cpm).
I have a count matrix in a DGEList object and I calculated the counts per million (CPM) and log2(CPM) as follow:
> CPM <- cpm(x)
> logCPM <- cpm(x, log=TRUE, prior.count = 1)
The CPM matrix looks as expected but the logCPM returns negative values, and with a prior.count=1.0 I would expect only positive values.
Indeed if I compute log2(CPM+1), I get completely different values. So what is cpm(x, log=TRUE) returning?
-------------------------------------------------------------- output of sessionInfo() --------------------------------------------------------
> sessionInfo()
R version 3.4.4 (2018-03-15) Platform: x86_64-apple-darwin15.6.0 (64-bit) Running under: macOS Sierra 10.12.6 Matrix products: default BLAS: /Library/Frameworks/R.framework/Versions/3.4/Resources/lib/libRblas.0.dylib LAPACK: /Library/Frameworks/R.framework/Versions/3.4/Resources/lib/libRlapack.dylib locale: [1] en_GB.UTF-8/en_GB.UTF-8/en_GB.UTF-8/C/en_GB.UTF-8/en_GB.UTF-8 attached base packages: [1] parallel stats4 stats graphics grDevices utils datasets [8] methods base other attached packages: [1] edgeR_3.20.9 [2] limma_3.34.9 [3] Homo.sapiens_1.3.1 [4] TxDb.Hsapiens.UCSC.hg19.knownGene_3.2.2 [5] org.Hs.eg.db_3.5.0 [6] GO.db_3.5.0 [7] OrganismDbi_1.20.0 [8] GenomicFeatures_1.30.3 [9] GenomicRanges_1.30.3 [10] GenomeInfoDb_1.14.0 [11] AnnotationDbi_1.40.0 [12] IRanges_2.12.0 [13] S4Vectors_0.16.0 [14] Biobase_2.38.0 [15] BiocGenerics_0.24.0 loaded via a namespace (and not attached): [1] SummarizedExperiment_1.8.1 pbdZMQ_0.3-2 [3] progress_1.1.2 locfit_1.5-9.1 [5] repr_0.12.0 lattice_0.20-35 [7] rtracklayer_1.38.3 blob_1.1.1 [9] XML_3.98-1.10 RBGL_1.54.0 [11] DBI_0.8 BiocParallel_1.12.0 [13] bit64_0.9-7 uuid_0.1-2 [15] matrixStats_0.53.1 GenomeInfoDbData_1.0.0 [17] stringr_1.3.0 zlibbioc_1.24.0 [19] Biostrings_2.46.0 memoise_1.1.0 [21] evaluate_0.10.1 biomaRt_2.34.2 [23] BiocInstaller_1.28.0 IRdisplay_0.4.4 [25] Rcpp_0.12.16 IRkernel_0.8.12.9000 [27] DelayedArray_0.4.1 graph_1.56.0 [29] jsonlite_1.5 XVector_0.18.0 [31] bit_1.1-12 Rsamtools_1.30.0 [33] RMySQL_0.10.14 digest_0.6.15 [35] stringi_1.1.7 grid_3.4.4 [37] tools_3.4.4 bitops_1.0-6 [39] magrittr_1.5 RCurl_1.95-4.10 [41] RSQLite_2.1.0 pkgconfig_2.0.1 [43] crayon_1.3.4 Matrix_1.2-13 [45] prettyunits_1.0.2 assertthat_0.2.0 [47] httr_1.3.1 R6_2.2.2 [49] GenomicAlignments_1.14.2 compiler_3.4.4
To add to James' comment, see posts from Gordon about the scaling of the prior count within
cpm()
:Differences between limma voom E values and edgeR cpm values?
I should also mention that computing
log2(CPM+1)
is, in my opinion, a Bad Idea. Consider an example where I get a CPM of 0.1 in library A and a CPM of 0.2 in library B. This is clearly a 2-fold change between libraries, but after computing the log-CPMs with the above naive formula, I would get values of 0.13 and 0.26 in A and B respectively. Clearly, this result is wrong - on the log-scale, I should see a difference of 1 log unit between these two libraries."Well, that's fine," you might say, "because a CPM of 0.1 is pretty low, and the
+1
will shrink it towards zero prior to log-transformation, thus stabilizing any log-fold changes between libraries in the presence of limited information." But is a CPM of 0.1 really that low? If my libraries contained 100 million reads, a CPM of 0.1 and 0.2 would correspond to counts of 10 and 20, respectively. That's more than enough information to tell me that I have a 2-fold change between A and B. Indeed,cpm()
will respect this and report log-CPMs that do, in fact, differ by ~1 log unit:In short, it makes more sense to apply the prior count on the scale of, well, the counts, rather than the CPMs.
actually log-CPM adds an offset of 1/L where 1 is the “prior count” and L is the average library size in millions. So the log-CPM values are related to the CPM values by log2(CPM + 1/L). L <- mean(x$samples$lib.size) * 1e-6
Close, but not quite. You cannot derive the edgeR log-CPMs from the naively computed CPM, because the library size is increased by twice the scaled prior count (see
?addPriorCount
for more details).