Dear all,
I was wondering if it is possible to convert logCPM values to CPM values after removing the batch effects using removeBatchEffect
since it only returns logCPM values.
Problem
As Aaron stated in this post, it is not a good idea to convert CPM to logCPM using log2(CPM + 1)
# y is a DGEList object
CPM <- cpm(y)
prior <- 1
logCPM <- log2(CPM + prior)
Instead, it is recommend to convert CPM to logCPM using
logCPM <- cpm(y, log = TRUE, prior.count = prior)
I assume that it is equally a bad idea to convert logCPM values to CPM by
CPM <- 2^logCPM
Since removeBatchEffect
takes logCPMs and returns batch corrected logCPMs, I was wondering if there is a straightforward method to transform those logCPMs into linear scale.
Case scenario
Let's assume that we have some batches of cells and they are treated with a drug in a controlled setup (group). However, the experiments performed by 2 independent labs and MDS plots showed that there is a clear batch effect (batch).
# Matrix size
nsamples <- 12
ngenes <- 1e5
# Experimental setup
meta <- data.frame(group = factor(rep(c("Control","Treated"), each = 6)),
batch = factor(rep(c(1,2), 6))
)
meta <- meta[order(meta$batch),]
rownames(meta) <- paste0("Sample", 1:nsamples)
Here is how our experimental setups looks like:
group batch
Sample1 Control 1
Sample2 Control 1
Sample3 Control 1
Sample4 Treated 1
Sample5 Treated 1
Sample6 Treated 1
Sample7 Control 2
Sample8 Control 2
Sample9 Control 2
Sample10 Treated 2
Sample11 Treated 2
Sample12 Treated 2
Let's create a DGEList
object for our experimental setup and remove the batch effect:
# counts matrix
set.seed(1)
counts <- matrix(rnbinom(n = ngenes * nsamples, mu=10, size=10), ncol = nsamples)
rownames(counts) <- paste0("Gene", 1:ngenes)
colnames(counts) <- rownames(meta)
# DGEList
y <- DGEList(counts = counts, samples = meta)
# Calculate normalization factor
y <- calcNormFactors(y)
# Correct for batch effect
prior <- 1
logCPM <- cpm(y, log = TRUE, prior.count = prior )
logCPM_noBatch <- removeBatchEffect(logCPM, batch = y$samples$batch)
Potential solution
I came across ftroot1010's answer on this post and I thought it might be possible to calculate CPM from logCPM.
To be able to calculate CPM value, we need:
- Library sizes
- Normalization factors
- Prior
I wrote a function taking these parameters into account and converting logCPM to CPM:
# Function to convert logCPM to CPM
logcpm2cpm <- function(logcpm, # logCPM matrix
dgelist, # DGEList object
prior = 0.25){ # Prior to add
y <- dgelist
nlib <- y$samples$lib.size * y$samples$norm.factors
prior.scaled <- prior * length(nlib) * nlib / sum(nlib)
pseudo.counts <- t((2^(t(logcpm)) * (nlib + 2*prior.scaled) / 1e+06) - prior.scaled)
cpm <- t( t(pseudo.counts) / nlib * 1e+06 )
return(cpm)
}
It looks like it is possible to convert logCPM values to CPM:
res1 <- cpm(y, prior)
res1[1:3, 1:4]
# Sample1 Sample2 Sample3 Sample4
# Gene1 11.004037 11.004448 9.985968 15.988008
# Gene2 8.002936 8.003235 8.987371 10.991755
# Gene3 7.002569 9.003639 10.984565 9.992505
res2 <- logcpm2cpm(logCPM, y, prior)
res2[1:3, 1:4]
# Sample1 Sample2 Sample3 Sample4
# Gene1 11.004037 11.004448 9.985968 15.988008
# Gene2 8.002936 8.003235 8.987371 10.991755
# Gene3 7.002569 9.003639 10.984565 9.992505
However, the values are not exactly identical. I expect that this is due to some rounding issues as stated by user ftroot1010, previously.
all(res1 == res2)
# FALSE
Question 1: Do you think if there is a better way of handling round-off errors?
Considering that we can calculate the CPMs if logCPM, DGEList object and prior are provided, I was considering:
CPM_noBatch <- logcpm2cpm(logCPM_noBatch, y, prior)
CPM_noBatch[1:3, 1:4]
# Sample1 Sample2 Sample3 Sample4
# Gene1 10.737547 10.737949 9.742079 15.61087
# Gene2 9.584383 9.584734 10.741741 13.09821
# Gene3 8.095322 10.369634 12.621051 11.49353
Question 2: Do you think if this is a good approach? Instead, would you recommend just 2^logCPM_noBatch
CPM_noBatch2 <- 2^logCPM_noBatch
CPM_noBatch2[1:3, 1:4]
# Sample1 Sample2 Sample3 Sample4
# Gene1 11.737662 11.73806 10.74220 16.61098
# Gene2 10.584500 10.58485 11.74186 14.09832
# Gene3 9.095442 11.36975 13.62116 12.49364
Thank you very much in advance for your time and insights.
Why do you want to do this?
Hi Aaron,
Sorry for not stating "why". Some downstream analysis tools like DODR, CIBERSORT etc. accept only non-logarithmic values. Knowing that there is a clear batch effect in my dataset, I didn't want to run the analysis with either
CPM
or2^logCPM_noBatch
values.