Hi,
I am trying to remove the batch effect of my confounders.
here is the DESeq2 mqtrix design that I used:
dds <- DESeqDataSetFromHTSeqCount(sampleTable=sampleTable,
directory=folder,
design=~Plate+RIN+Sex+Age+condition+PC3+PC2+PC1)
condition is the comparison of control individuals with patients.
dds <- estimateSizeFactors(dds)
keep <- rowSums( counts(dds) >= 10 ) >= 20 #filters out genes that have less than 20 samples with counts of 10
dds <- dds[keep,]
colData(dds)$condition <- relevel(colData(dds)$condition, ref = "Control")
dds<- DESeq(dds)
vsd <- vst(dds, blind = FALSE)
- I noticed variance normalized transformation in DESeq2 are not corrected for variables in my design formula.
- So, then I found I could use removeBatchEffect if I am right.
I want to correct my counts for Sex, Age, RIN, PC1, PC2, and PC3 (I want to remove the effect of them). My purpose is to correct my normalized counts for these confounders to use as input for WGCNA.
- I used the following lines from the DESeq2 manual:
mat <- assay(vsd)
mm <- model.matrix(~condition, colData(vsd))
mat <- limma::removeBatchEffect(mat, batch=vsd$batch, design=mm)
assay(vsd) <- mat
plotPCA(vsd)
- Did I define the mm right? (The condition is my control and disease group which I want to avoid removing variation associated with it. Plate, RIN, Sex, Age, PC1, PC2,PC3 are the ones that I want to correct my counts for them)
. What do I have to define in mm <- model.matrix()? the ones that I want to correct for them or the ones that I want to avoid removing variation associated with it (my condition control vs patients)?
- Also, here in mat how can I define the batch=vsd$
mat <- limma::removeBatchEffect(mat, batch=vsd$batch, design=mm)
- Should I write it like this:
mat <- limma::removeBatchEffect(mat, batch=vsd$Age, batch2=vsd$Sex, covariates1= vsd$PC1, covariates2=vsd$PC2 , covariates3=.., covariates4=..., design=mm)
# batch argument should be a categorical factor (like sex) and the covariates argument is for numerical covariates (like age).
- But, still in the case if I used the limma::removeBatchEffect in the right way, the counts that I am getting from mat (assay(vsd) <- mat) are really similar to the assay(vsd) counts before using the limma::removeBatchEffect
I would appreciate to help me please understand how to use it and how to correct my data for the confounders that I have.
Thank you in advance!
sessionInfo( )
R version 4.3.0 (2023-04-21)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 22.04.1 LTS
Matrix products: default
BLAS: /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.10.0
LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.10.0
locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 LC_PAPER=en_US.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
time zone: Europe/Brussels
tzcode source: system (glibc)
attached base packages:
[1] stats4 stats graphics grDevices utils datasets methods base
other attached packages:
[1] GOSemSim_2.26.1 DOSE_3.26.1 devtools_2.4.5 usethis_2.2.2
[5] knitr_1.43 lubridate_1.9.2 forcats_1.0.0 stringr_1.5.0
[9] dplyr_1.1.2 purrr_1.0.1 readr_2.1.4 tidyr_1.3.0
[13] tibble_3.2.1 tidyverse_2.0.0 EnhancedVolcano_1.18.0 biomaRt_2.56.1
[17] DESeq2_1.40.2 SummarizedExperiment_1.30.2 MatrixGenerics_1.12.2 matrixStats_1.0.0
[21] GenomicRanges_1.52.0 GenomeInfoDb_1.36.1 GO.db_3.17.0 apeglm_1.22.1
[25] pheatmap_1.0.12 plyr_1.8.8 reshape2_1.4.4 RColorBrewer_1.1-3
[29] ggrepel_0.9.3 limma_3.56.2 WGCNA_1.72-1 fastcluster_1.2.3
[33] dynamicTreeCut_1.63-1 ggnewscale_0.4.9 ggplot2_3.4.2 enrichplot_1.20.0
[37] pathview_1.40.0 rrvgo_1.12.0 org.Hs.eg.db_3.17.0 AnnotationDbi_1.62.2
[41] IRanges_2.34.1 S4Vectors_0.38.1 Biobase_2.60.0 BiocGenerics_0.46.0
[45] clusterProfiler_4.8.2 enrichR_3.2
loaded via a namespace (and not attached):
[1] fs_1.6.3 bitops_1.0-7 HDO.db_0.99.1 httr_1.4.6 doParallel_1.0.17
[6] Rgraphviz_2.44.0 numDeriv_2016.8-1.1 profvis_0.3.8 tools_4.3.0 backports_1.4.1
[11] utf8_1.2.3 R6_2.5.1 lazyeval_0.2.2 urlchecker_1.0.1 withr_2.5.0
[16] prettyunits_1.1.1 gridExtra_2.3 preprocessCore_1.62.1 cli_3.6.1 scatterpie_0.2.1
[21] labeling_0.4.2 slam_0.1-50 KEGGgraph_1.60.0 mvtnorm_1.2-2 tm_0.7-11
[26] askpass_1.1 yulab.utils_0.0.6 gson_0.1.0 foreign_0.8-82 sessioninfo_1.2.2
[31] WriteXLS_6.4.0 bbmle_1.0.25 rstudioapi_0.15.0 impute_1.74.1 RSQLite_2.3.1
[36] treemap_2.4-4 generics_0.1.3 gridGraphics_0.5-1 Matrix_1.6-0 fansi_1.0.4
[41] lifecycle_1.0.3 yaml_2.3.7 qvalue_2.32.0 BiocFileCache_2.8.0 grid_4.3.0
[46] blob_1.2.4 promises_1.2.0.1 crayon_1.5.2 bdsmatrix_1.3-6 miniUI_0.1.1.1
[51] lattice_0.20-45 cowplot_1.1.1 KEGGREST_1.40.0 pillar_1.9.0 fgsea_1.26.0
[56] rjson_0.2.21 codetools_0.2-19 fastmatch_1.1-3 glue_1.6.2 downloader_0.4
[61] ggfun_0.1.1 remotes_2.4.2.1 data.table_1.14.8 vctrs_0.6.3 png_0.1-8
[66] treeio_1.24.2 gtable_0.3.3 emdbook_1.3.13 cachem_1.0.8 xfun_0.39
[71] S4Arrays_1.0.4 mime_0.12 tidygraph_1.2.3 coda_0.19-4 survival_3.4-0
[76] iterators_1.0.14 ellipsis_0.3.2 nlme_3.1-162 ggtree_3.8.0 bit64_4.0.5
[81] progress_1.2.2 filelock_1.0.2 rpart_4.1.19 colorspace_2.1-0 DBI_1.1.3
[86] Hmisc_5.1-0 nnet_7.3-18 processx_3.8.2 tidyselect_1.2.0 bit_4.0.5
[91] compiler_4.3.0 curl_5.0.1 graph_1.78.0 htmlTable_2.4.1 xml2_1.3.5
[96] NLP_0.2-1 DelayedArray_0.26.6 shadowtext_0.1.2 checkmate_2.2.0 scales_1.2.1
[101] callr_3.7.3 rappdirs_0.3.3 digest_0.6.33 rmarkdown_2.23 XVector_0.40.0
[106] htmltools_0.5.5 pkgconfig_2.0.3 base64enc_0.1-3 umap_0.2.10.0 dbplyr_2.3.3
[111] fastmap_1.1.1 rlang_1.1.1.9000 htmlwidgets_1.6.2 shiny_1.7.4.1 farver_2.1.1
[116] jsonlite_1.8.7 BiocParallel_1.34.2 RCurl_1.98-1.12 magrittr_2.0.3 Formula_1.2-5
[121] GenomeInfoDbData_1.2.10 ggplotify_0.1.1 wordcloud_2.6 patchwork_1.1.2 munsell_0.5.0
[126] Rcpp_1.0.11 ape_5.7-1 viridis_0.6.4 reticulate_1.30 stringi_1.7.12
[131] ggraph_2.1.0 zlibbioc_1.46.0 MASS_7.3-58.2 pkgbuild_1.4.2 parallel_4.3.0
[136] Biostrings_2.68.1 graphlayouts_1.0.0 splines_4.3.0 hms_1.1.3 locfit_1.5-9.8
[141] ps_1.7.5 igraph_1.5.0.1 pkgload_1.3.2.1 XML_3.99-0.14 evaluate_0.21
[146] BiocManager_1.30.21.1 tzdb_0.4.0 foreach_1.5.2 tweenr_2.0.2 httpuv_1.6.11
[151] openssl_2.1.0 polyclip_1.10-4 gridBase_0.4-7 ggforce_0.4.1 xtable_1.8-4
[156] RSpectra_0.16-1 tidytree_0.4.4 later_1.3.1 viridisLite_0.4.2 aplot_0.1.10
[161] memoise_2.0.1 cluster_2.1.4 timechange_0.2.0
Thank you for your comment and support. Just to make it clear for myself:
So, the "design=design.condition" is the one that I want to be preserved.
and the covariates=design.confounders is the one that I want to adjust for.
Sorry, am I getting right?
Now, when I am looking at design.condition <- design[,1:2] it has two columns Intercept and conditionpatient. the Intercept column, all are 1, and in the conditionpatient column, it assigned patients to 1 and controls to 0. Could you please explain what is this column?
Thanks! looking forward to hearing from you!
Yes. Did I not already say that?
It represents the log-fold-change between patients and controls. Please look up any reference on linear models, for example: