logCPM to CPM conversion after removeBatchEffect()
Entering edit mode
Last seen 9 weeks ago

Dear all,

I was wondering if it is possible to convert logCPM values to CPM values after removing the batch effects using removeBatchEffect since it only returns logCPM values.


As Aaron stated in this post, it is not a good idea to convert CPM to logCPM using log2(CPM + 1)

# y is a DGEList object
CPM <- cpm(y)
prior <- 1
logCPM <- log2(CPM + prior)

Instead, it is recommend to convert CPM to logCPM using

logCPM <- cpm(y, log = TRUE, prior.count = prior)

I assume that it is equally a bad idea to convert logCPM values to CPM by

CPM <- 2^logCPM

Since removeBatchEffect takes logCPMs and returns batch corrected logCPMs, I was wondering if there is a straightforward method to transform those logCPMs into linear scale.

Case scenario

Let's assume that we have some batches of cells and they are treated with a drug in a controlled setup (group). However, the experiments performed by 2 independent labs and MDS plots showed that there is a clear batch effect (batch).

# Matrix size
nsamples <- 12
ngenes <- 1e5

# Experimental setup
meta <- data.frame(group = factor(rep(c("Control","Treated"), each = 6)),
                   batch = factor(rep(c(1,2), 6))
meta <- meta[order(meta$batch),]
rownames(meta) <- paste0("Sample", 1:nsamples)

Here is how our experimental setups looks like:

           group batch
Sample1  Control     1
Sample2  Control     1
Sample3  Control     1
Sample4  Treated     1
Sample5  Treated     1
Sample6  Treated     1
Sample7  Control     2
Sample8  Control     2
Sample9  Control     2
Sample10 Treated     2
Sample11 Treated     2
Sample12 Treated     2

Let's create a DGEList object for our experimental setup and remove the batch effect:

# counts matrix
counts <- matrix(rnbinom(n = ngenes * nsamples, mu=10, size=10), ncol = nsamples)
rownames(counts) <- paste0("Gene", 1:ngenes)
colnames(counts) <- rownames(meta)

# DGEList
y <- DGEList(counts = counts, samples = meta)

# Calculate normalization factor
y <- calcNormFactors(y)

# Correct for batch effect
prior  <- 1
logCPM <- cpm(y, log = TRUE, prior.count = prior ) 
logCPM_noBatch <- removeBatchEffect(logCPM, batch = y$samples$batch) 

Potential solution

I came across ftroot1010's answer on this post and I thought it might be possible to calculate CPM from logCPM.

To be able to calculate CPM value, we need:

  1. Library sizes
  2. Normalization factors
  3. Prior

I wrote a function taking these parameters into account and converting logCPM to CPM:

# Function to convert logCPM to CPM
logcpm2cpm <- function(logcpm,        # logCPM matrix
                       dgelist,       # DGEList object 
                       prior = 0.25){ # Prior to add
  y <- dgelist
  nlib <- y$samples$lib.size * y$samples$norm.factors
  prior.scaled <- prior * length(nlib) * nlib / sum(nlib)
  pseudo.counts <- t((2^(t(logcpm)) * (nlib + 2*prior.scaled) / 1e+06) - prior.scaled)
  cpm <- t( t(pseudo.counts) / nlib * 1e+06 )

It looks like it is possible to convert logCPM values to CPM:

res1 <- cpm(y, prior)
res1[1:3, 1:4]

#         Sample1   Sample2   Sample3   Sample4
# Gene1 11.004037 11.004448  9.985968 15.988008
# Gene2  8.002936  8.003235  8.987371 10.991755
# Gene3  7.002569  9.003639 10.984565  9.992505

res2 <- logcpm2cpm(logCPM, y, prior)
res2[1:3, 1:4]

#         Sample1   Sample2   Sample3   Sample4
# Gene1 11.004037 11.004448  9.985968 15.988008
# Gene2  8.002936  8.003235  8.987371 10.991755
# Gene3  7.002569  9.003639 10.984565  9.992505

However, the values are not exactly identical. I expect that this is due to some rounding issues as stated by user ftroot1010, previously.

all(res1 == res2)

Question 1: Do you think if there is a better way of handling round-off errors?

Considering that we can calculate the CPMs if logCPM, DGEList object and prior are provided, I was considering:

CPM_noBatch <- logcpm2cpm(logCPM_noBatch, y, prior)
CPM_noBatch[1:3, 1:4]

#         Sample1   Sample2   Sample3  Sample4
# Gene1 10.737547 10.737949  9.742079 15.61087
# Gene2  9.584383  9.584734 10.741741 13.09821
# Gene3  8.095322 10.369634 12.621051 11.49353

Question 2: Do you think if this is a good approach? Instead, would you recommend just 2^logCPM_noBatch

CPM_noBatch2 <- 2^logCPM_noBatch
CPM_noBatch2[1:3, 1:4]

#         Sample1  Sample2  Sample3  Sample4
# Gene1 11.737662 11.73806 10.74220 16.61098
# Gene2 10.584500 10.58485 11.74186 14.09832
# Gene3  9.095442 11.36975 13.62116 12.49364

Thank you very much in advance for your time and insights.

edger removeBatchEffect cpm logCPM limma • 2.6k views
Entering edit mode

Why do you want to do this?

Entering edit mode

Hi Aaron,

Sorry for not stating "why". Some downstream analysis tools like DODR, CIBERSORT etc. accept only non-logarithmic values. Knowing that there is a clear batch effect in my dataset, I didn't want to run the analysis with either CPM or 2^logCPM_noBatch values.

Entering edit mode
Aaron Lun ★ 28k
Last seen 9 minutes ago
The city by the bay

There are two solutions. The first is to follow the instructions here. This will give you "batch corrected counts" for which you can compute CPMs as usual. However - do not be tempted to use these counts for any serious statistical analysis, for the same reason that you shouldn't use the output of removeBatchEffect() for linear modelling.

The second solution is relevant if your downstream methods operate on a per-sample basis, i.e., the calculations for one sample are independent of the calculations for another sample (ignoring PRNG seeding issues). Let's assume that CIBERSORT gives you independent per-sample estimates of the cell type proportions. If this is true, you can simply run the method on each sample, get per-sample proportions, and then include the batch effect in your analysis on the proportions. For example, if you're looking for differential proportions, just put the batch effect in the linear model that you're fitting to the proportion estimates. Much more direct than fiddling with the counts, and more correct as well, because now that downstream model has the opportunity to account for the uncertainty of the batch effect.

Entering edit mode

Hi Aaron,

Thank you very much for your answers. If I may, I have some follow-up questions:

Question 1: The solution 1 you have provided is a very elegant way of getting the pseudo-counts and I will be using that for CPM calculation if needed. Could you also help me to understand what I was missing on my attempt with logcpm2cpm function? I also calculate pseudo-counts there, but obviously it gives me different numbers than your approach.

Regarding to the estimateDisp part in your solution 1, would you use "common dispersion" (y$common.dispersion) or something else (y$trended.dispersion or y$tagwise.dispersion)?

# (...)
design.all <- model.matrix(~group + batch)
y <- estimateDisp(y, design = design.all)
fit <- glmFit(y, design.all, dispersion = y$common.dispersion)
# (...)

Question 2: Solution 2 proposes a really nice way of modeling the differential cell proportion using CIBERSOFT, thanks a lot for that, I will try it out. However, let's say that I am only interested in the "true" proportion of each cell type in each sample rather than differential proportions. Would it be ok to use batch corrected CPM values in that case?

Question 3: I am a bit puzzled about the use cases of removeBatchEffect() output. I know that you strongly recommend not to use it for any statistical analysis, but some tools do not allow any covariate or batch in the models. I have some case scenarios below and would definitely love to hear your opinion whether it is appropriate to use batch effect removed CPM/logCPMs or not.

The experimental setup for case 1-2 was shown in the original question:

  • Case 1: After initial observation of MDS plots, I observed a strong batch effect and I was wondering if I can see separation of control/treated cells on MDS plots if there were no batch effects.

  • Case 2: I have metabolic/physiological measurements (e.g. mitochondria concentration, nucleus size, respiration rate) for only Control cells and I would like to evaluate if there is any correlation between the metabolic/physiological measurements and the gene expression levels. To do that, I will use a non-parametric approach (e.g. Spearman correlation) and I will not be able to add any covariate as a batch effect.

# dat is a data.frame
> dat
##         Gene1_CPM_noBatch Mitochondria_concentration
## Sample1         10.999296                   5.064678
## Sample2         11.014054                   4.039562
## Sample3          9.994203                   5.049669
## Sample7          9.656054                   5.943472
## Sample8         10.690625                   4.539509
## Sample9          9.623157                   7.751637

> cor(dat[,1], dat[,2], method = "spearman")
## -0.8285714
  • Case 3: A majority of the tools for detecting the (circadian) rhythmicity on gene expression (e.g. rain, JTK_CYCLE) does not consider any underlying covariates. Let's say that we have time series data with batch effect and only input value is for the analysis tool is CPM values.
meta <- data.frame(cell = rep(gl(6,6),2),
                   time = rep(seq(4, 24, by = 4), 12),
                   group = factor(rep(c("Control","Treated"), each = 36)),
                   batch = factor(rep(rep(c(1,2), each = 18), 2))
meta <- meta[order(meta$batch),]

Should I use batch effect removed CPM/logCPM in any of these case scenarios? If not, what would you recommend?

Thank you so much for your insights in advance.

Entering edit mode

Question 1. There is no guarantee that the un-logged CPMs will be positive after subtracting the prior count. If you used removeBatchEffect, the corrected log-CPMs may well be below log2(prior.count).

Regarding the dispersion: I would use the tagwise dispersion, but it probably doesn't matter all that much.

Question 2. Why do you think that removing the batch effect in the counts will give you the true proportions? You have given no reason to think that the estimated proportions in one batch are more accurate than the estimated proportions in another batch. Let's say that you get 40% T cells consistently in one batch and 20% T cells in another batch. Which one is the true proportion? Batch correction isn't going to answer that for you, it just makes them all the same.

(Aside: I would only imagine this argument to be reasonable if, e.g., one batch was generated using a sequencing technology that is known to work with CIBERSORT, while the other batch was generated with another technology with unknown or worse performance. In that case, I could imagine getting rid of the batch effect in the counts of the second batch so as to improve CIBERSORT's accuracy. However, this requires care with the arguments of removeBatchEffect.)

Question 3.

Case 1 is an ad hoc visual examination, not a serious analysis.

Case 2 is correctly handled by fitting a linear model using the terms of interest as covariates. Your comment about non-parametric methods suggests you're worried about non-linear trends, in which case you should use splines.

Case 3 is the only scenario where the use of batch corrected values can be reluctantly accepted. If those methods are spitting out p-values, they will likely not hold their size. You will lose type I error control during inference, e.g., you think you have a FDR of 5% but your FDR might be anywhere from 10-20%. But hey, some result is better than nothing.

Entering edit mode

Hi Aaron,

I am grateful to you for your kind and detailed answers. Regarding to:

Question 1: You are definitely right. When I was simulating the counts, I have also come across to negative un-logged CPMs when the mu and size parameters for rnbinom are low.

nsamples <- 12
ngenes <- 1e5

counts <- matrix(rnbinom(n = ngenes * nsamples, mu=3, size=3), ncol = nsamples)
rownames(counts)  <- paste0("Sample", 1:nsamples)

y <- DGEList(counts = counts)
y <- calcNormFactors(y)

prior  <- 1
CPM <- cpm(y, prior.count = prior) 
#    Sample1   Sample2  Sample3   Sample4
# 1 9.963040  0.000000 10.09907 13.296543
# 2 6.642027 16.593693  0.00000  6.648271
# 3 6.642027  3.318739  6.73271  6.648271

logCPM <- cpm(y, log = TRUE, prior.count = prior) 
CPM_manual <- logcpm2cpm(logCPM, y, prior)

#    Sample1      Sample2       Sample3   Sample4
# 1 9.963040 1.473816e-15  1.009907e+01 13.296543
# 2 6.642027 1.659369e+01 -4.484886e-15  6.648271
# 3 6.642027 3.318739e+00  6.732710e+00  6.648271

And thanks a lot for your suggestion on the dispersion type to use.

Question 2: I agree. The batch effect we are observing might be even from the different compositions on the cell types. It is better to keep it as it is for the deconvolution.

Question 3: - Case 2: Yes, I was worried about the non-linear trends in the data. Instead of using a predictive model, I am more interested whether there are any significant associations. As you have mentioned several times to avoid batch corrected CPMs in any serious statistical analysis, I try to avoid them as much as possible. Apparently, PResiduals considers covariates in Spearman correlation, so I will use that when I observe any batch effects.

Regarding to splines, I believe it requires a lot of optimization depending on the non-linear trend and noise in the data as discussed in details here.

And lastly, thank you so much for your time and help. I learned a lot from this conversation.

Entering edit mode

Instead of using a predictive model, I am more interested whether there are any significant associations.

Two sides of the same coin, really. Noise in the covariate just manifests and is modeled as increased variability in the response when you're fitting a linear model.

Apparently, PResiduals considers covariates in Spearman correlation, so I will use that when I observe any batch effects.

I don't have any experience with that package, so you're on your own there. But expect rank-based approaches to be underpowered when you don't have many replicates. You have 9 degrees of freedom in the bare minimum model; that makes it difficult to get a low p-value from a rank-based test. (Or even the accuracy of the test, if it involves asymptotic approximations.) And I'm not even considering problems with exchangeability of libraries, as discussed here.

Regarding to splines, I believe it requires a lot of optimization depending on the non-linear trend and noise in the data

Not really. Care is required if you want to use the actual fitted values of the curve for something, but in your case, you just want to know if there is a significant effect. In the latter case, the analysis is more forgiving with respect to the choice of df; too many df just results in decrease in power rather than invalid results.


Login before adding your answer.

Traffic: 629 users visited in the last hour
Help About
Access RSS

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6