Question

removing the batch effect for vst counts and defining the batches in removeBatchEffect function

0

Entering edit mode

Sara • 0

@95b4edca

Last seen 13 hours ago

Belgium

Hi,

I am trying to remove the batch effect of my confounders.

here is the DESeq2 mqtrix design that I used:

dds <- DESeqDataSetFromHTSeqCount(sampleTable=sampleTable,
                                  directory=folder,
                                  design=~Plate+RIN+Sex+Age+condition+PC3+PC2+PC1)

condition is the comparison of control individuals with patients.

dds <- estimateSizeFactors(dds)
keep <- rowSums( counts(dds) >= 10 ) >= 20 #filters out genes that have less than 20 samples with counts of 10
dds <- dds[keep,]
colData(dds)$condition <- relevel(colData(dds)$condition, ref = "Control")
dds<- DESeq(dds)

vsd <- vst(dds, blind = FALSE)

I noticed variance normalized transformation in DESeq2 are not corrected for variables in my design formula.

So, then I found I could use removeBatchEffect if I am right.

I want to correct my counts for Sex, Age, RIN, PC1, PC2, and PC3 (I want to remove the effect of them). My purpose is to correct my normalized counts for these confounders to use as input for WGCNA.

I used the following lines from the DESeq2 manual:

mat <- assay(vsd)
mm <- model.matrix(~condition, colData(vsd))

mat <- limma::removeBatchEffect(mat, batch=vsd$batch, design=mm)
assay(vsd) <- mat

plotPCA(vsd)

Did I define the mm right? (The condition is my control and disease group which I want to avoid removing variation associated with it. Plate, RIN, Sex, Age, PC1, PC2,PC3 are the ones that I want to correct my counts for them)

. What do I have to define in mm <- model.matrix()? the ones that I want to correct for them or the ones that I want to avoid removing variation associated with it (my condition control vs patients)?

Also, here in mat how can I define the batch=vsd$

mat <- limma::removeBatchEffect(mat, batch=vsd$batch, design=mm)

Should I write it like this:

mat <- limma::removeBatchEffect(mat, batch=vsd$Age, batch2=vsd$Sex, covariates1= vsd$PC1, covariates2=vsd$PC2 , covariates3=.., covariates4=...,  design=mm) 

#  batch argument should be a categorical factor (like sex) and the covariates argument is for numerical covariates (like age).

But, still in the case if I used the limma::removeBatchEffect in the right way, the counts that I am getting from mat (assay(vsd) <- mat) are really similar to the assay(vsd) counts before using the limma::removeBatchEffect

I would appreciate to help me please understand how to use it and how to correct my data for the confounders that I have.

Thank you in advance!

sessionInfo( )
R version 4.3.0 (2023-04-21)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 22.04.1 LTS

Matrix products: default
BLAS:   /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.10.0 
LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.10.0

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C               LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8    LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C             LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

time zone: Europe/Brussels
tzcode source: system (glibc)

attached base packages:
[1] stats4    stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] GOSemSim_2.26.1             DOSE_3.26.1                 devtools_2.4.5              usethis_2.2.2              
 [5] knitr_1.43                  lubridate_1.9.2             forcats_1.0.0               stringr_1.5.0              
 [9] dplyr_1.1.2                 purrr_1.0.1                 readr_2.1.4                 tidyr_1.3.0                
[13] tibble_3.2.1                tidyverse_2.0.0             EnhancedVolcano_1.18.0      biomaRt_2.56.1             
[17] DESeq2_1.40.2               SummarizedExperiment_1.30.2 MatrixGenerics_1.12.2       matrixStats_1.0.0          
[21] GenomicRanges_1.52.0        GenomeInfoDb_1.36.1         GO.db_3.17.0                apeglm_1.22.1              
[25] pheatmap_1.0.12             plyr_1.8.8                  reshape2_1.4.4              RColorBrewer_1.1-3         
[29] ggrepel_0.9.3               limma_3.56.2                WGCNA_1.72-1                fastcluster_1.2.3          
[33] dynamicTreeCut_1.63-1       ggnewscale_0.4.9            ggplot2_3.4.2               enrichplot_1.20.0          
[37] pathview_1.40.0             rrvgo_1.12.0                org.Hs.eg.db_3.17.0         AnnotationDbi_1.62.2       
[41] IRanges_2.34.1              S4Vectors_0.38.1            Biobase_2.60.0              BiocGenerics_0.46.0        
[45] clusterProfiler_4.8.2       enrichR_3.2                

loaded via a namespace (and not attached):
  [1] fs_1.6.3                bitops_1.0-7            HDO.db_0.99.1           httr_1.4.6              doParallel_1.0.17      
  [6] Rgraphviz_2.44.0        numDeriv_2016.8-1.1     profvis_0.3.8           tools_4.3.0             backports_1.4.1        
 [11] utf8_1.2.3              R6_2.5.1                lazyeval_0.2.2          urlchecker_1.0.1        withr_2.5.0            
 [16] prettyunits_1.1.1       gridExtra_2.3           preprocessCore_1.62.1   cli_3.6.1               scatterpie_0.2.1       
 [21] labeling_0.4.2          slam_0.1-50             KEGGgraph_1.60.0        mvtnorm_1.2-2           tm_0.7-11              
 [26] askpass_1.1             yulab.utils_0.0.6       gson_0.1.0              foreign_0.8-82          sessioninfo_1.2.2      
 [31] WriteXLS_6.4.0          bbmle_1.0.25            rstudioapi_0.15.0       impute_1.74.1           RSQLite_2.3.1          
 [36] treemap_2.4-4           generics_0.1.3          gridGraphics_0.5-1      Matrix_1.6-0            fansi_1.0.4            
 [41] lifecycle_1.0.3         yaml_2.3.7              qvalue_2.32.0           BiocFileCache_2.8.0     grid_4.3.0             
 [46] blob_1.2.4              promises_1.2.0.1        crayon_1.5.2            bdsmatrix_1.3-6         miniUI_0.1.1.1         
 [51] lattice_0.20-45         cowplot_1.1.1           KEGGREST_1.40.0         pillar_1.9.0            fgsea_1.26.0           
 [56] rjson_0.2.21            codetools_0.2-19        fastmatch_1.1-3         glue_1.6.2              downloader_0.4         
 [61] ggfun_0.1.1             remotes_2.4.2.1         data.table_1.14.8       vctrs_0.6.3             png_0.1-8              
 [66] treeio_1.24.2           gtable_0.3.3            emdbook_1.3.13          cachem_1.0.8            xfun_0.39              
 [71] S4Arrays_1.0.4          mime_0.12               tidygraph_1.2.3         coda_0.19-4             survival_3.4-0         
 [76] iterators_1.0.14        ellipsis_0.3.2          nlme_3.1-162            ggtree_3.8.0            bit64_4.0.5            
 [81] progress_1.2.2          filelock_1.0.2          rpart_4.1.19            colorspace_2.1-0        DBI_1.1.3              
 [86] Hmisc_5.1-0             nnet_7.3-18             processx_3.8.2          tidyselect_1.2.0        bit_4.0.5              
 [91] compiler_4.3.0          curl_5.0.1              graph_1.78.0            htmlTable_2.4.1         xml2_1.3.5             
 [96] NLP_0.2-1               DelayedArray_0.26.6     shadowtext_0.1.2        checkmate_2.2.0         scales_1.2.1           
[101] callr_3.7.3             rappdirs_0.3.3          digest_0.6.33           rmarkdown_2.23          XVector_0.40.0         
[106] htmltools_0.5.5         pkgconfig_2.0.3         base64enc_0.1-3         umap_0.2.10.0           dbplyr_2.3.3           
[111] fastmap_1.1.1           rlang_1.1.1.9000        htmlwidgets_1.6.2       shiny_1.7.4.1           farver_2.1.1           
[116] jsonlite_1.8.7          BiocParallel_1.34.2     RCurl_1.98-1.12         magrittr_2.0.3          Formula_1.2-5          
[121] GenomeInfoDbData_1.2.10 ggplotify_0.1.1         wordcloud_2.6           patchwork_1.1.2         munsell_0.5.0          
[126] Rcpp_1.0.11             ape_5.7-1               viridis_0.6.4           reticulate_1.30         stringi_1.7.12         
[131] ggraph_2.1.0            zlibbioc_1.46.0         MASS_7.3-58.2           pkgbuild_1.4.2          parallel_4.3.0         
[136] Biostrings_2.68.1       graphlayouts_1.0.0      splines_4.3.0           hms_1.1.3               locfit_1.5-9.8         
[141] ps_1.7.5                igraph_1.5.0.1          pkgload_1.3.2.1         XML_3.99-0.14           evaluate_0.21          
[146] BiocManager_1.30.21.1   tzdb_0.4.0              foreach_1.5.2           tweenr_2.0.2            httpuv_1.6.11          
[151] openssl_2.1.0           polyclip_1.10-4         gridBase_0.4-7          ggforce_0.4.1           xtable_1.8-4           
[156] RSpectra_0.16-1         tidytree_0.4.4          later_1.3.1             viridisLite_0.4.2       aplot_0.1.10           
[161] memoise_2.0.1           cluster_2.1.4           timechange_0.2.0

DESeq2 BatchEffect limma • 1.3k views

ADD COMMENT • link updated 8 months ago by Gordon Smyth 50k • written 8 months ago by Sara • 0

0

Entering edit mode

James W. MacDonald 65k

@james-w-macdonald-5106

Last seen 15 hours ago

United States

The line that comes right after the code in the DESeq2 vignette says

The design argument is necessary to avoiding removing variation associated with the treatment conditions. See ?removeBatchEffect in the limma package for details.

Which is not what you are trying to do! You already said you want to remove variation associated with the treatment conditions. It also says to look at the help page for removeBatchEffect, which would have told you that there are no such arguments as covariates1 or covariates2. You cannot use arbitrary arguments to a function! What the help page will tell you is that you can adjust for two batch effects (batch and batch2), and any number of numeric covariates if you put them in a matrix.

You have two factors (sex and Plate) that are 'batch effects', and multiple other things (sex, RIN, PC1-3) that are numeric covariates.

ADD COMMENT • link 8 months ago James W. MacDonald 65k

0

Entering edit mode

Thanks for your comment! I think I didn't write it really clearly before. I don't want to remove the variation associated with the treatment condition. But I want to remove the variation associated with Plate, RIN, Sex, Age, PC1, PC2,PC3.

ADD REPLY • link 8 months ago Sara • 0

score 2 · Accepted Answer · 2023-08-11

2

Entering edit mode

Gordon Smyth 50k

@gordon-smyth

Last seen 1 hour ago

WEHI, Melbourne, Australia

As James MacDonald has already noted, it would help you to consult the help page for removeBatchEffect() by typing ?removeBatchEffect. All limma functions have help pages that you can consult in this way.

In your case, you need

design <- model.matrix(~condition+Plate+RIN+Sex+Age+PC3+PC2+PC1)
design.condition <- design[,1:2]
design.confounders <- design[,-c(1:2)]
mat.corrected <- removeBatchEffect(mat,
   design=design.condition,
   covariates=design.confounders)

This code separates out the design matrix for condition from the design matrix columns for the confounders. The removeBatchEffect function then removes the effects of the confounders while preserving the effect of condition.

The code assumes that condition is a factor with two levels. If it has more than two levels, then change the column indices from 1:2 to 1:n where n is the number of levels.

ADD COMMENT • link 8 months ago Gordon Smyth 50k

0

Entering edit mode

Thank you for your comment and support. Just to make it clear for myself:

So, the "design=design.condition" is the one that I want to be preserved.

and the covariates=design.confounders is the one that I want to adjust for.

Sorry, am I getting right?

Now, when I am looking at design.condition <- design[,1:2] it has two columns Intercept and conditionpatient. the Intercept column, all are 1, and in the conditionpatient column, it assigned patients to 1 and controls to 0. Could you please explain what is this column?

Thanks! looking forward to hearing from you!

design.condition <- design[,1:2]


                    (Intercept) conditionpatient
patienet                1            1
patienet                1            1
patienet                1            1
patienet                1            1
control                 1            0
control                 1            0
control                 1            0
control                 1            0
control                 1            0

ADD REPLY • link 8 months ago Sara • 0

0

Entering edit mode

So, the "design=design.condition" is the one that I want to be preserved. and the covariates=design.confounders is the one that I want to adjust for.

Yes. Did I not already say that?

Could you please explain what is this column?

It represents the log-fold-change between patients and controls. Please look up any reference on linear models, for example:

ADD REPLY • link 8 months ago Gordon Smyth 50k