Hello! I'm using edgeR on DMR mode. My data are: five different conditions (without a clear reference) in two different replicates (condA_1, condA_2, cond_B1, cond_B2...). there is a clear batch effect based on the replicate so i want to include a replicate column in the design matrix. I build the matrix like this:
model.matrix(~ 0 + groups + rep)
groupsGII groupsGV groupsR groupsS groupsT reptwo
GII_1 1 0 0 0 0 0
GII_2 1 0 0 0 0 1
GV_1 0 1 0 0 0 0
GV_2 0 1 0 0 0 1
# ...
That makes sense. When I use the modelMatrixMeth though the info for the "second replicate factor" gets lost for the <cond>_rep2_UN, and it's kept only for <cond>_rep2_MET. I get something like this:
Sample1 Sample2 Sample3 Sample4 Sample5 Sample6 Sample7 Sample8 Sample9 Sample10 groupsGII groupsGV groupsR groupsS groupsT reptwo
GII_1_MET 1 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0
GII_1_UNM 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
GV_2_MET 0 1 0 0 0 0 0 0 0 0 1 0 0 0 0 1
GV_2_UNM 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 **0**
and I would expect smth like:
Sample1 Sample2 Sample3 Sample4 Sample5 Sample6 Sample7 Sample8 Sample9 Sample10 groupsGII groupsGV groupsR groupsS groupsT reptwo
GII_1_MET 1 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0
GII_1_UNM 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
GV_2_MET 0 1 0 0 0 0 0 0 0 0 1 0 0 0 0 1
GV_2_UNM 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 **1**
Also, if I try to set it "manually", the estimation fails and I get this error:
> dge_f_d <- estimateDisp(dge_f, design = d_new_m, trend="none")
Error in glmFit.default(sely, design, offset = seloffset, dispersion = 0.05, :
Design matrix not of full rank. The following coefficients not estimable:
reptwo
Thanks a lot!!! Francesco
> sessionInfo()
R version 4.3.3 (2024-02-29)
Platform: x86_64-redhat-linux-gnu (64-bit)
Running under: AlmaLinux 9.3 (Shamrock Pampas Cat)
Matrix products: default
BLAS/LAPACK: FlexiBLAS OPENBLAS-OPENMP; LAPACK version 3.9.0
locale:
[1] LC_CTYPE=C.UTF-8 LC_NUMERIC=C LC_TIME=C.UTF-8 LC_COLLATE=C.UTF-8 LC_MONETARY=C.UTF-8 LC_MESSAGES=C.UTF-8 LC_PAPER=C.UTF-8
[8] LC_NAME=C LC_ADDRESS=C LC_TELEPHONE=C LC_MEASUREMENT=C.UTF-8 LC_IDENTIFICATION=C
time zone: UTC
tzcode source: system (glibc)
attached base packages:
[1] stats4 stats graphics grDevices utils datasets methods base
other attached packages:
[1] Biostrings_2.70.3 XVector_0.42.0 data.table_1.17.0 viridis_0.6.5 viridisLite_0.4.2 EnhancedVolcano_1.20.0
[7] ggrepel_0.9.6 lubridate_1.9.4 forcats_1.0.0 stringr_1.5.1 dplyr_1.1.4 purrr_1.0.4
[13] readr_2.1.5 tibble_3.2.1 tidyverse_2.0.0 tidyr_1.3.1 gridExtra_2.3 ggpmisc_0.6.1
[19] ggpp_0.5.8-1 pheatmap_1.0.12 factoextra_1.0.7 ggplot2_3.5.1 DESeq2_1.42.1 SummarizedExperiment_1.32.0
[25] MatrixGenerics_1.14.0 matrixStats_1.5.0 GenomicRanges_1.54.1 GenomeInfoDb_1.38.8 org.Hs.eg.db_3.18.0 AnnotationDbi_1.64.1
[31] IRanges_2.36.0 S4Vectors_0.40.2 Biobase_2.62.0 BiocGenerics_0.48.1 edgeR_4.0.16 limma_3.58.1
[37] NBPSeq_0.3.1 qvalue_2.34.0
loaded via a namespace (and not attached):
[1] DBI_1.2.3 bitops_1.0-9 polynom_1.4-1 rlang_1.1.5 magrittr_2.0.3 compiler_4.3.3 RSQLite_2.3.9
[8] png_0.1-8 vctrs_0.6.5 reshape2_1.4.4 quantreg_6.1 pkgconfig_2.0.3 crayon_1.5.3 fastmap_1.2.0
[15] backports_1.5.0 rmarkdown_2.29 tzdb_0.4.0 MatrixModels_0.5-3 bit_4.6.0 xfun_0.51 zlibbioc_1.48.2
[22] cachem_1.1.0 blob_1.2.4 DelayedArray_0.28.0 BiocParallel_1.36.0 broom_1.0.7 parallel_4.3.3 R6_2.6.1
[29] stringi_1.8.4 RColorBrewer_1.1-3 car_3.1-3 Rcpp_1.0.14 knitr_1.49 timechange_0.3.0 Matrix_1.6-5
[36] splines_4.3.3 tidyselect_1.2.1 yaml_2.3.10 rstudioapi_0.17.1 abind_1.4-8 codetools_0.2-19 lattice_0.22-5
[43] plyr_1.8.9 withr_3.0.2 KEGGREST_1.42.0 evaluate_1.0.3 survival_3.5-8 pillar_1.10.1 ggpubr_0.6.0
[50] carData_3.0-5 generics_0.1.3 RCurl_1.98-1.16 hms_1.1.3 munsell_0.5.1 scales_1.3.0 glue_1.8.0
[57] tools_4.3.3 SparseM_1.84-2 locfit_1.5-9.12 ggsignif_0.6.4 grid_4.3.3 colorspace_2.1-1 GenomeInfoDbData_1.2.11
[64] Formula_1.2-5 cli_3.6.4 S4Arrays_1.2.1 gtable_0.3.6 rstatix_0.7.2 digest_0.6.37 SparseArray_1.2.4
[71] memoise_2.0.1 htmltools_0.5.8.1 lifecycle_1.0.4 httr_1.4.7 statmod_1.5.0 bit64_4.6.0-1 MASS_7.3-60.0.1
Ah, this is a confusing type of design matrix. It's hard to create using R's formula syntax. Also a very common pattern, so worth getting to know!
Section 7.5 in this vignette is essentially applying the same trick.
If you imagine your samples as a factor, the samples are nested within rep. By accounting for the effect of each sample, you have also automatically accounted for the effect of rep. So the linear model already accounts for GV_2_UNM being in reptwo with the Sample2 coefficient. However we might further suppose there is a difference between UNM and MET that is only seen in the reptwo samples, and which applies to all of the reptwo samples, which is what the reptwo coefficient accounts for.
You should see the same pattern with your group factor.