Hello Everyone,
I am new to the limma based batch correction functionalities. I am trying to use removeBatchEffect() function from limma (3.24.6) package. I have three confounding variables, for which I want to remove their effect on the gene expression. While removing the effect of confounding variables, I want to preserve the information about the treatment in the sample. I have used DESeq2 to variance stabilizing transformation of the gene expression, stored in the vst_df dataframe of the code snippet below. I have following question:
Q: I am using first and second confounding variables as batch and batch2 information in the removeBatchEffect function. The third confounding variable, I am using as a covariates matrix. Whether the way I am using removeBatchEffect is correct or not?
> head(covariates_df,8) # A,B and C is age,gender and race respectively TREAT A B C S_1 1 24 2 1 S_2 1 30 1 1 S_3 1 26 2 1 S_4 1 25 1 1 S_5 2 30 2 1 S_6 1 29 1 1 S_7 1 28 1 1 S_8 1 22 1 1 > str(covariates_df) 'data.frame': 150 obs. of 4 variables: $ TREAT: Factor w/ 2 levels "1","2": 1 1 1 1 2 1 1 2 2 1 ... $ A : num 24 30 26 25 30 29 28 28 28 22 ... $ B : Factor w/ 2 levels "1","2": 2 1 2 1 2 1 1 2 2 1 ... $ C : Factor w/ 4 levels "1","2","3","4": 1 1 1 1 1 1 1 1 2 1 ... > dim(vst_df) # gene expression in columns and samples in row [1] 150 11622 > tapply(vst_df$XIST, covariates_df$B, mean) 1 2 5.837229 10.764095 > design <- model.matrix(~TREAT,data=covariates_df) > design (Intercept) TREAT2 S_1 1 0 S_2 1 0 S_3 1 0 S_4 1 0 #....... S_149 1 1 S_150 1 1 attr(,"assign") [1] 0 1 attr(,"contrasts") attr(,"contrasts")$TREAT [1] "contr.treatment" > batch1 <- as.factor(covariates_df$B) > batch2 <- as.factor(covariates_df$C) > corrected_vst_df <- removeBatchEffect(t(vst_df),batch=batch1,batch2=batch2,covariates=covariates_df[,c('A')],design=design) > dim(corrected_vst) [1] 11622 150 > tapply(vst_new$XIST, vst_new$B, mean) 1 2 7.722130 7.880831 > sessionInfo() R version 3.2.0 (2015-04-16) Platform: x86_64-unknown-linux-gnu (64-bit) Running under: CentOS release 6.2 (Final) locale: [1] C attached base packages: [1] parallel stats4 stats graphics grDevices utils datasets [8] methods base other attached packages: [1] limma_3.24.6 BiocParallel_1.2.2 [3] dplyr_0.4.1 DESeq2_1.8.1 [5] RcppArmadillo_0.5.200.1.0 Rcpp_0.11.6 [7] GenomicRanges_1.20.4 GenomeInfoDb_1.4.0 [9] IRanges_2.2.3 S4Vectors_0.6.0 [11] BiocGenerics_0.14.0 data.table_1.9.4
Also check if your confounders are orthogonal to your treatment effect. For example, if most treated individuals were male, and most untreated individuals were female, there's clearly a dependency between treatment and sex, so removal of the (uninteresting) sex effect may also remove some of the (interesting) treatment effect as well. Setting
design
as you've done provides some protection, but it's easy to imagine cases where loss of interesting signal still happens, especially in noisy data.Anyway, whether or not the confounders are orthogonal or not depends on the experimental design, so there's nothing much you can do about it in the analysis. Still, it's good to check for peace of mind; hopefully you don't have to worry about it.
Thanks Aaron for your quick reply.
I have checked that at-least two of my confounders for being orthogonal to treatment effect or not.
However, for three out of four levels of race, there is little or no treated information
Do you think, I should worry too much about the confounders orthogonality to treatment?
I would say that confounding between race and treatment is a potential issue, but as Aaron says, there's nothing you can do to fix that now that the experiment has already been done. Also, why are you using
tapply
for this? Just dotable(covariates_df$TREAT, covariates_df$B)
and you'll get a much clearer presentation of the data.Thanks Ryan for your quick reply.
Your suggestions are really helpful. I am not sure, if I got you regarding exactly how to prevent
removeBatchEffect()
from changing the overall mean expression of each gene.Basically, just run
scale()
on thebatch.design
before runningremoveBatchEffect
, and the returned expression matrix should (I think) have identical row means to the input matrix (of course, you can easily check this).I believe you have a small typo, when you say
I'm not sure what you mean by "having exactly two treatments", but the above example is when you have one treatment, 3 batches, no?