Question

edgeR cpm() function with and without log2

3

Entering edit mode

b.nota ▴ 370

@bnota-7379

Last seen 5.4 years ago

Netherlands

Hello,

I have a question about the cpm function from edgeR. When I use this function with log = T, I get different results from when I use it without followed by log2 transformation afterwards. What did I miss here?

Edit: Has this to do with the scaling of the prior count? If yes, what is the benefit behind this? Why is that better than just adding 0.5 read count?

> CPM <- cpm(DGE1, log = T, prior.count = 0.5, normalized.lib.sizes = F)
> tail(CPM)
                        DC07      DC08      DC09      DC10      DC11      DC12
ENSMUSG00000099399 -5.935507 -5.935507 -5.935507 -5.935507 -5.935507 -5.935507
ENSMUSG00000095134 -5.935507 -5.935507 -5.935507 -3.647512 -5.935507 -5.935507
ENSMUSG00000095366 -5.935507 -5.935507 -5.935507 -5.935507 -5.935507 -5.935507
ENSMUSG00000096768 -4.385629 -4.434476 -5.935507 -4.378766 -5.935507 -5.935507
ENSMUSG00000099871 -5.935507 -5.935507 -5.935507 -5.935507 -5.935507 -5.935507
ENSMUSG00000096850 -5.935507 -5.935507 -5.935507 -5.935507 -5.935507 -5.935507

> CPM_F <- cpm(DGE1, log = F, normalized.lib.sizes = F)
> tail(CPM_F)
                       DC07       DC08 DC09       DC10 DC11 DC12
ENSMUSG00000099399 0.000000 0.00000000    0 0.00000000    0    0
ENSMUSG00000095134 0.000000 0.00000000    0 0.06345822    0    0
ENSMUSG00000095366 0.000000 0.00000000    0 0.00000000    0    0
ENSMUSG00000096768 0.031501 0.02990833    0 0.03172911    0    0
ENSMUSG00000099871 0.000000 0.00000000    0 0.00000000    0    0
ENSMUSG00000096850 0.000000 0.00000000    0 0.00000000    0    0

> log2CPM <- log2(CPM_F + 0.5)
> tail(log2CPM)
                         DC07       DC08 DC09       DC10 DC11 DC12
ENSMUSG00000099399 -1.0000000 -1.0000000   -1 -1.0000000   -1   -1
ENSMUSG00000095134 -1.0000000 -1.0000000   -1 -0.8276194   -1   -1
ENSMUSG00000095366 -1.0000000 -1.0000000   -1 -1.0000000   -1   -1
ENSMUSG00000096768 -0.9118557 -0.9161853   -1 -0.9112366   -1   -1
ENSMUSG00000099871 -1.0000000 -1.0000000   -1 -1.0000000   -1   -1
ENSMUSG00000096850 -1.0000000 -1.0000000   -1 -1.0000000   -1   -1

R version 3.4.3 (2017-11-30)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 7 x64 (build 7601) Service Pack 1

Matrix products: default

locale:
[1] LC_COLLATE=English_United States.1252
[2] LC_CTYPE=English_United States.1252   
[3] LC_MONETARY=English_United States.1252
[4] LC_NUMERIC=C                          
[5] LC_TIME=English_United States.1252    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] gplots_3.0.1 edgeR_3.20.9 limma_3.34.9

loaded via a namespace (and not attached):
 [1] compiler_3.4.3     Rcpp_0.12.15       KernSmooth_2.23-15 splines_3.4.3     
 [5] gdata_2.18.0       grid_3.4.3         locfit_1.5-9.1     caTools_1.17.1    
 [9] bitops_1.0-6       gtools_3.5.0       lattice_0.20-35

edger cpm • 3.7k views

ADD COMMENT • link updated 7.9 years ago by Aaron Lun ★ 29k • written 7.9 years ago by b.nota ▴ 370

score 3 · Answer 1 · 2018-03-13

As you may have already noticed, it is because cpm adds a prior count to the counts for each library when log=TRUE. This avoids undefined values from counts of zero, and it also stabilizes the differences in log-expression values between libraries, i.e., it squeezes the log-fold changes towards zero, especially for low counts where there would be little evidence for large fold changes anyway.

Scaling ensures that the relative effect of the added prior count is the same in each library, regardless of sequencing depth. Simply adding 0.5 to each count would effectively result in a larger value being added to counts in small libraries, once you divide by the library size to compute the CPM. This would result in spurious non-zero log-fold changes; see Differences between limma voom E values and edgeR cpm values? for details.