Question: How to use removeBatchEffect for removing effect of multiple confounding variables
gravatar for ompandey
2.8 years ago by
U.S.A/New York/Icahn School of Medicine
ompandey0 wrote:

Hello Everyone,

I am new to the limma based batch correction functionalities. I am trying to use removeBatchEffect() function from limma (3.24.6) package. I have three confounding variables, for which I want to remove their effect on the gene expression. While removing the effect of confounding variables, I want to preserve the information about the treatment in the sample. I have used DESeq2 to variance stabilizing transformation of the gene expression, stored in the vst_df dataframe of the code snippet below. I have following question:

Q: I am using first and second confounding variables as batch and batch2 information in the removeBatchEffect function. The third confounding variable, I am using as a covariates matrix. Whether the way I am using removeBatchEffect is correct or not?

>  head(covariates_df,8) # A,B and C is age,gender and race respectively
     TREAT  A    B    C
S_1      1  24   2    1
S_2      1  30   1    1
S_3      1  26   2    1
S_4      1  25   1    1
S_5      2  30   2    1
S_6      1  29   1    1
S_7      1  28   1    1
S_8      1  22   1    1

> str(covariates_df)
'data.frame':    150 obs. of  4 variables:
$ TREAT: Factor w/ 2 levels "1","2": 1 1 1 1 2 1 1 2 2 1 ...
$ A    : num  24 30 26 25 30 29 28 28 28 22 ...
$ B    : Factor w/ 2 levels "1","2": 2 1 2 1 2 1 1 2 2 1 ...
$ C    : Factor w/ 4 levels "1","2","3","4": 1 1 1 1 1 1 1 1 2 1 ...

> dim(vst_df) # gene expression in columns and samples in row

[1]   150 11622

> tapply(vst_df$XIST, covariates_df$B, mean)
        1         2 
 5.837229 10.764095

> design <- model.matrix(~TREAT,data=covariates_df)

> design
            (Intercept) TREAT2
S_1             1       0
S_2             1       0
S_3             1       0
S_4             1       0
S_149           1       1
S_150           1       1
[1] 0 1
[1] "contr.treatment"

> batch1 <- as.factor(covariates_df$B)
> batch2 <- as.factor(covariates_df$C)

> corrected_vst_df <- removeBatchEffect(t(vst_df),batch=batch1,batch2=batch2,covariates=covariates_df[,c('A')],design=design)

> dim(corrected_vst)
[1] 11622   150

> tapply(vst_new$XIST, vst_new$B, mean)
       1        2 
7.722130 7.880831 

> sessionInfo()
R version 3.2.0 (2015-04-16)
Platform: x86_64-unknown-linux-gnu (64-bit)
Running under: CentOS release 6.2 (Final)
[1] C
attached base packages:
[1] parallel  stats4    stats     graphics  grDevices utils     datasets
[8] methods   base    
other attached packages:
[1] limma_3.24.6              BiocParallel_1.2.2      
[3] dplyr_0.4.1               DESeq2_1.8.1            
[5] RcppArmadillo_0. Rcpp_0.11.6             
[7] GenomicRanges_1.20.4      GenomeInfoDb_1.4.0      
[9] IRanges_2.2.3             S4Vectors_0.6.0         
[11] BiocGenerics_0.14.0       data.table_1.9.4         
limma deseq2 removebatcheffect • 1.3k views
ADD COMMENTlink modified 2.8 years ago by Ryan C. Thompson7.2k • written 2.8 years ago by ompandey0
Answer: How to use removeBatchEffect for removing effect of multiple confounding variabl
gravatar for Ryan C. Thompson
2.8 years ago by
The Scripps Research Institute, La Jolla, CA
Ryan C. Thompson7.2k wrote:

The batch1 and batch2 arguments are really just for convenience. Everything gets combined into the covariates matrix before doing the math. In the most general case, the way to run removeBatchEffect is to create a single design matrix with all your covariates, both treatment and batches, and then split it into two matrices, one containing the columns corresponding to your covariates of interest (and the intercept, if any), and the other corresponding to the batch effects. In your case, it would be (assuming you have exactly 2 treatments):

design <- model.matrix(~TREAT + A + B + C, data=covariates_df) <- design[,1:2] <- design[,-(1:2)]
corrected_vst_df <- removeBatchEffect(t(vst_df),,

One thing to keep in mind when doing this is that the batch effect will be removed by adjusting everything to match the base level of each factor and the zero point of each numeric covariate. So every sample will be adjusted to age zero (by extrapolating linearly), and then, if female and Asian are the first levels of the gender and race factor, then the male samples will be adjusted to have the same mean as the females, and every other race will be adjusted to have the same mean as the Asian samples. Obviously this will change the overall mean expression of each gene. I'm pretty sure you can keep removeBatchEffect from changing the overall means by first subtracting the column mean from each column of the batch design (see the scale function).

ADD COMMENTlink modified 2.8 years ago • written 2.8 years ago by Ryan C. Thompson7.2k

Also check if your confounders are orthogonal to your treatment effect. For example, if most treated individuals were male, and most untreated individuals were female, there's clearly a dependency between treatment and sex, so removal of the (uninteresting) sex effect may also remove some of the (interesting) treatment effect as well. Setting design as you've done provides some protection, but it's easy to imagine cases where loss of interesting signal still happens, especially in noisy data.

Anyway, whether or not the confounders are orthogonal or not depends on the experimental design, so there's nothing much you can do about it in the analysis. Still, it's good to check for peace of mind; hopefully you don't have to worry about it.

ADD REPLYlink modified 2.8 years ago • written 2.8 years ago by Aaron Lun23k

Thanks Aaron for your quick reply. 

I have checked that at-least two of my confounders for being orthogonal to treatment effect or not.

> tapply(as.numeric(covariates_df$TREAT),covariates_df$B, table)
$`1`  # male
1  2  # 1 as treated and 2 as untreated
32 29
$`2`  # female
1  2  # 1 as treated and 2 as untreated
65 24

However, for three out of four levels of race, there is little or no treated information 

> tapply(as.numeric(covariates_df$TREAT), covariates_df$C, table)
1  2 # 1 as treated and 2 as untreated
96 21
1  2
1 23

Do you think, I should worry too much about the confounders orthogonality to treatment? 

ADD REPLYlink written 2.8 years ago by ompandey0

I would say that confounding between race and treatment is a potential issue, but as Aaron says, there's nothing you can do to fix that now that the experiment has already been done. Also, why are you using tapply for this? Just do table(covariates_df$TREAT, covariates_df$B) and you'll get a much clearer presentation of the data.

ADD REPLYlink modified 2.8 years ago • written 2.8 years ago by Ryan C. Thompson7.2k

Thanks Ryan for your quick reply.

Your suggestions are really helpful. I am not sure, if I got you regarding exactly how to prevent removeBatchEffect() from changing the overall mean expression of each gene. 

ADD REPLYlink written 2.8 years ago by ompandey0

Basically, just run scale() on the before running removeBatchEffect, and the returned expression matrix should (I think) have identical row means to the input matrix (of course, you can easily check this).

ADD REPLYlink modified 2.8 years ago • written 2.8 years ago by Ryan C. Thompson7.2k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 16.09
Traffic: 190 users visited in the last hour