Question

edgeR in DMR mode - replicates give batch effect - how to build design matrix with condition and replicate

0

Entering edit mode

Francesco • 0

@374098e5

Last seen 7 months ago

Spain

Hello! I'm using edgeR on DMR mode. My data are: five different conditions (without a clear reference) in two different replicates (condA_1, condA_2, cond_B1, cond_B2...). there is a clear batch effect based on the replicate so i want to include a replicate column in the design matrix. I build the matrix like this:

model.matrix(~ 0 + groups + rep)
       groupsGII groupsGV groupsR groupsS groupsT reptwo
GII_1        1        0       0       0       0      0
GII_2        1        0       0       0       0      1
GV_1         0        1       0       0       0      0
GV_2         0        1       0       0       0      1
# ...

That makes sense. When I use the modelMatrixMeth though the info for the "second replicate factor" gets lost for the <cond>_rep2_UN, and it's kept only for <cond>_rep2_MET. I get something like this:

             Sample1 Sample2 Sample3 Sample4 Sample5 Sample6 Sample7 Sample8 Sample9 Sample10 groupsGII groupsGV groupsR groupsS groupsT reptwo
GII_1_MET        1       0       0       0       0       0       0       0       0        0         1        0       0       0       0      0
GII_1_UNM        1       0       0       0       0       0       0       0       0        0         0        0       0       0       0      0
GV_2_MET         0       1       0       0       0       0       0       0       0        0         1        0       0       0       0      1
GV_2_UNM         0       1       0       0       0       0       0       0       0        0         0        0       0       0       0      **0**

and I would expect smth like:

             Sample1 Sample2 Sample3 Sample4 Sample5 Sample6 Sample7 Sample8 Sample9 Sample10 groupsGII groupsGV groupsR groupsS groupsT reptwo
GII_1_MET        1       0       0       0       0       0       0       0       0        0         1        0       0       0       0      0
GII_1_UNM        1       0       0       0       0       0       0       0       0        0         0        0       0       0       0      0
GV_2_MET         0       1       0       0       0       0       0       0       0        0         1        0       0       0       0      1
GV_2_UNM         0       1       0       0       0       0       0       0       0        0         0        0       0       0       0      **1**

Also, if I try to set it "manually", the estimation fails and I get this error:

    > dge_f_d <- estimateDisp(dge_f, design = d_new_m, trend="none")
Error in glmFit.default(sely, design, offset = seloffset, dispersion = 0.05,  : 
  Design matrix not of full rank.  The following coefficients not estimable:
 reptwo

Thanks a lot!!! Francesco

> sessionInfo()
R version 4.3.3 (2024-02-29)
Platform: x86_64-redhat-linux-gnu (64-bit)
Running under: AlmaLinux 9.3 (Shamrock Pampas Cat)

Matrix products: default
BLAS/LAPACK: FlexiBLAS OPENBLAS-OPENMP;  LAPACK version 3.9.0

locale:
 [1] LC_CTYPE=C.UTF-8       LC_NUMERIC=C           LC_TIME=C.UTF-8        LC_COLLATE=C.UTF-8     LC_MONETARY=C.UTF-8    LC_MESSAGES=C.UTF-8    LC_PAPER=C.UTF-8      
 [8] LC_NAME=C              LC_ADDRESS=C           LC_TELEPHONE=C         LC_MEASUREMENT=C.UTF-8 LC_IDENTIFICATION=C   

time zone: UTC
tzcode source: system (glibc)

attached base packages:
[1] stats4    stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] Biostrings_2.70.3           XVector_0.42.0              data.table_1.17.0           viridis_0.6.5               viridisLite_0.4.2           EnhancedVolcano_1.20.0     
 [7] ggrepel_0.9.6               lubridate_1.9.4             forcats_1.0.0               stringr_1.5.1               dplyr_1.1.4                 purrr_1.0.4                
[13] readr_2.1.5                 tibble_3.2.1                tidyverse_2.0.0             tidyr_1.3.1                 gridExtra_2.3               ggpmisc_0.6.1              
[19] ggpp_0.5.8-1                pheatmap_1.0.12             factoextra_1.0.7            ggplot2_3.5.1               DESeq2_1.42.1               SummarizedExperiment_1.32.0
[25] MatrixGenerics_1.14.0       matrixStats_1.5.0           GenomicRanges_1.54.1        GenomeInfoDb_1.38.8         org.Hs.eg.db_3.18.0         AnnotationDbi_1.64.1       
[31] IRanges_2.36.0              S4Vectors_0.40.2            Biobase_2.62.0              BiocGenerics_0.48.1         edgeR_4.0.16                limma_3.58.1               
[37] NBPSeq_0.3.1                qvalue_2.34.0              

loaded via a namespace (and not attached):
 [1] DBI_1.2.3               bitops_1.0-9            polynom_1.4-1           rlang_1.1.5             magrittr_2.0.3          compiler_4.3.3          RSQLite_2.3.9          
 [8] png_0.1-8               vctrs_0.6.5             reshape2_1.4.4          quantreg_6.1            pkgconfig_2.0.3         crayon_1.5.3            fastmap_1.2.0          
[15] backports_1.5.0         rmarkdown_2.29          tzdb_0.4.0              MatrixModels_0.5-3      bit_4.6.0               xfun_0.51               zlibbioc_1.48.2        
[22] cachem_1.1.0            blob_1.2.4              DelayedArray_0.28.0     BiocParallel_1.36.0     broom_1.0.7             parallel_4.3.3          R6_2.6.1               
[29] stringi_1.8.4           RColorBrewer_1.1-3      car_3.1-3               Rcpp_1.0.14             knitr_1.49              timechange_0.3.0        Matrix_1.6-5           
[36] splines_4.3.3           tidyselect_1.2.1        yaml_2.3.10             rstudioapi_0.17.1       abind_1.4-8             codetools_0.2-19        lattice_0.22-5         
[43] plyr_1.8.9              withr_3.0.2             KEGGREST_1.42.0         evaluate_1.0.3          survival_3.5-8          pillar_1.10.1           ggpubr_0.6.0           
[50] carData_3.0-5           generics_0.1.3          RCurl_1.98-1.16         hms_1.1.3               munsell_0.5.1           scales_1.3.0            glue_1.8.0             
[57] tools_4.3.3             SparseM_1.84-2          locfit_1.5-9.12         ggsignif_0.6.4          grid_4.3.3              colorspace_2.1-1        GenomeInfoDbData_1.2.11
[64] Formula_1.2-5           cli_3.6.4               S4Arrays_1.2.1          gtable_0.3.6            rstatix_0.7.2           digest_0.6.37           SparseArray_1.2.4      
[71] memoise_2.0.1           htmltools_0.5.8.1       lifecycle_1.0.4         httr_1.4.7              statmod_1.5.0           bit64_4.6.0-1           MASS_7.3-60.0.1

edgeR • 892 views

ADD COMMENT • link updated 6 months ago by Paul Harrison ▴ 110 • written 7 months ago by Francesco • 0

0

Entering edit mode

Ah, this is a confusing type of design matrix. It's hard to create using R's formula syntax. Also a very common pattern, so worth getting to know!

Section 7.5 in this vignette is essentially applying the same trick.

If you imagine your samples as a factor, the samples are nested within rep. By accounting for the effect of each sample, you have also automatically accounted for the effect of rep. So the linear model already accounts for GV_2_UNM being in reptwo with the Sample2 coefficient. However we might further suppose there is a difference between UNM and MET that is only seen in the reptwo samples, and which applies to all of the reptwo samples, which is what the reptwo coefficient accounts for.

You should see the same pattern with your group factor.

ADD REPLY • link 6 months ago Paul Harrison ▴ 110

score 0 · Answer 1 · 2025-05-21

0

Entering edit mode

Gordon Smyth 53k

@gordon-smyth

Last seen 2 hours ago

WEHI, Melbourne, Australia

modelMatrixMeth() works correctly. The second replicate factor has not been lost.

Please do not make ad hoc manual modifications to the design matrix. Please use the correct design matrix as provided by the software.

ADD COMMENT • link 7 months ago Gordon Smyth 53k