Question

How to use removeBatchEffect for removing effect of multiple confounding variables

0

Entering edit mode

ompandey ▴ 10

@ompandey-10724

Last seen 6.6 years ago

U.S.A/New York/Icahn School of Medicine

Hello Everyone,

I am new to the limma based batch correction functionalities. I am trying to use removeBatchEffect() function from limma (3.24.6) package. I have three confounding variables, for which I want to remove their effect on the gene expression. While removing the effect of confounding variables, I want to preserve the information about the treatment in the sample. I have used DESeq2 to variance stabilizing transformation of the gene expression, stored in the vst_df dataframe of the code snippet below. I have following question:

Q: I am using first and second confounding variables as batch and batch2 information in the removeBatchEffect function. The third confounding variable, I am using as a covariates matrix. Whether the way I am using removeBatchEffect is correct or not?

>  head(covariates_df,8) # A,B and C is age,gender and race respectively
     TREAT  A    B    C
S_1      1  24   2    1
S_2      1  30   1    1
S_3      1  26   2    1
S_4      1  25   1    1
S_5      2  30   2    1
S_6      1  29   1    1
S_7      1  28   1    1
S_8      1  22   1    1



> str(covariates_df)
'data.frame':    150 obs. of  4 variables:
$ TREAT: Factor w/ 2 levels "1","2": 1 1 1 1 2 1 1 2 2 1 ...
$ A    : num  24 30 26 25 30 29 28 28 28 22 ...
$ B    : Factor w/ 2 levels "1","2": 2 1 2 1 2 1 1 2 2 1 ...
$ C    : Factor w/ 4 levels "1","2","3","4": 1 1 1 1 1 1 1 1 2 1 ...

> dim(vst_df) # gene expression in columns and samples in row

[1]   150 11622

> tapply(vst_df$XIST, covariates_df$B, mean)
        1         2 
 5.837229 10.764095

> design <- model.matrix(~TREAT,data=covariates_df)

> design
            (Intercept) TREAT2
S_1             1       0
S_2             1       0
S_3             1       0
S_4             1       0
#.......
S_149           1       1
S_150           1       1
attr(,"assign")
[1] 0 1
attr(,"contrasts")
attr(,"contrasts")$TREAT
[1] "contr.treatment"

> batch1 <- as.factor(covariates_df$B)
> batch2 <- as.factor(covariates_df$C)

> corrected_vst_df <- removeBatchEffect(t(vst_df),batch=batch1,batch2=batch2,covariates=covariates_df[,c('A')],design=design)

> dim(corrected_vst)
[1] 11622   150

> tapply(vst_new$XIST, vst_new$B, mean)
       1        2 
7.722130 7.880831 

> sessionInfo()
R version 3.2.0 (2015-04-16)
Platform: x86_64-unknown-linux-gnu (64-bit)
Running under: CentOS release 6.2 (Final)
locale:
[1] C
attached base packages:
[1] parallel  stats4    stats     graphics  grDevices utils     datasets
[8] methods   base    
other attached packages:
[1] limma_3.24.6              BiocParallel_1.2.2      
[3] dplyr_0.4.1               DESeq2_1.8.1            
[5] RcppArmadillo_0.5.200.1.0 Rcpp_0.11.6             
[7] GenomicRanges_1.20.4      GenomeInfoDb_1.4.0      
[9] IRanges_2.2.3             S4Vectors_0.6.0         
[11] BiocGenerics_0.14.0       data.table_1.9.4

limma removebatcheffect deseq2 • 5.9k views

ADD COMMENT • link updated 7.9 years ago by Ryan C. Thompson ★ 7.9k • written 7.9 years ago by ompandey ▴ 10

score 11 · Answer 1 · 2016-06-01

11

Entering edit mode

Ryan C. Thompson ★ 7.9k

@ryan-c-thompson-5618

Last seen 8 months ago

Scripps Research, La Jolla, CA

The batch1 and batch2 arguments are really just for convenience. Everything gets combined into the covariates matrix before doing the math. In the most general case, the way to run removeBatchEffect is to create a single design matrix with all your covariates, both treatment and batches, and then split it into two matrices, one containing the columns corresponding to your covariates of interest (and the intercept, if any), and the other corresponding to the batch effects. In your case, it would be (assuming you have exactly 2 treatments):

design <- model.matrix(~TREAT + A + B + C, data=covariates_df)
treatment.design <- design[,1:2]
batch.design <- design[,-(1:2)]
corrected_vst_df <- removeBatchEffect(t(vst_df),design=treatment.design,covariates=batch.design)

One thing to keep in mind when doing this is that the batch effect will be removed by adjusting everything to match the base level of each factor and the zero point of each numeric covariate. So every sample will be adjusted to age zero (by extrapolating linearly), and then, if female and Asian are the first levels of the gender and race factor, then the male samples will be adjusted to have the same mean as the females, and every other race will be adjusted to have the same mean as the Asian samples. Obviously this will change the overall mean expression of each gene. I'm pretty sure you can keep removeBatchEffect from changing the overall means by first subtracting the column mean from each column of the batch design (see the scale function).

ADD COMMENT • link 7.9 years ago Ryan C. Thompson ★ 7.9k

2

Entering edit mode

Also check if your confounders are orthogonal to your treatment effect. For example, if most treated individuals were male, and most untreated individuals were female, there's clearly a dependency between treatment and sex, so removal of the (uninteresting) sex effect may also remove some of the (interesting) treatment effect as well. Setting design as you've done provides some protection, but it's easy to imagine cases where loss of interesting signal still happens, especially in noisy data.

Anyway, whether or not the confounders are orthogonal or not depends on the experimental design, so there's nothing much you can do about it in the analysis. Still, it's good to check for peace of mind; hopefully you don't have to worry about it.

ADD REPLY • link 7.9 years ago Aaron Lun ★ 28k

0

Entering edit mode

Thanks Aaron for your quick reply.

I have checked that at-least two of my confounders for being orthogonal to treatment effect or not.

> tapply(as.numeric(covariates_df$TREAT),covariates_df$B, table)
$`1`  # male
1  2  # 1 as treated and 2 as untreated
32 29
$`2`  # female
1  2  # 1 as treated and 2 as untreated
65 24

However, for three out of four levels of race, there is little or no treated information

> tapply(as.numeric(covariates_df$TREAT), covariates_df$C, table)
$`1`
1  2 # 1 as treated and 2 as untreated
96 21
$`2`
1  2
1 23
$`3`
2
5
$`4`
2 
4

Do you think, I should worry too much about the confounders orthogonality to treatment?

ADD REPLY • link 7.9 years ago ompandey ▴ 10

1

Entering edit mode

I would say that confounding between race and treatment is a potential issue, but as Aaron says, there's nothing you can do to fix that now that the experiment has already been done. Also, why are you using tapply for this? Just do table(covariates_df$TREAT, covariates_df$B) and you'll get a much clearer presentation of the data.

ADD REPLY • link 7.9 years ago Ryan C. Thompson ★ 7.9k

0

Entering edit mode

Thanks Ryan for your quick reply.

Your suggestions are really helpful. I am not sure, if I got you regarding exactly how to prevent removeBatchEffect() from changing the overall mean expression of each gene.

ADD REPLY • link 7.9 years ago ompandey ▴ 10

3

Entering edit mode

Basically, just run scale() on the batch.design before running removeBatchEffect, and the returned expression matrix should (I think) have identical row means to the input matrix (of course, you can easily check this).

ADD REPLY • link 7.9 years ago Ryan C. Thompson ★ 7.9k

0

Entering edit mode

I believe you have a small typo, when you say

In your case, it would be (assuming you have exactly 2 treatments):

I'm not sure what you mean by "having exactly two treatments", but the above example is when you have one treatment, 3 batches, no?

ADD REPLY • link 3.6 years ago Sam ▴ 10