Question

help checking the design matrix and contrast in DESeq2 "not full rank" scenario

0

Entering edit mode

ccshao ▴ 70

@shao-chunxuan-6243

Last seen 16 months ago

Germany

Hi Community, I would like to get feedback about model designs and contrast dealing the issue of "Model not full rank" in DESeq2.

The sample meta info:

library(magrittr)
s_meta <- data.frame(grp = paste(rep(c("D", "L", "M", "H"), times = 2), rep(c("6h", "24h"), each = 4), sep = "_") %>%
                     rep(times = 12) %>%
                     factor(levels = c("D_6h", "L_6h", "M_6h", "H_6h", "D_24h", "L_24h", "M_24h", "H_24h")),
                     b = rep(paste0("b", 1:4), times = c(16, 24, 32, 24)) %>% factor,
                     id = paste0("U", rep(1:12, each = 8)))

Patients info:

s_meta
     grp  b  id
1   D_6h b1  U1
2   L_6h b1  U1
3   M_6h b1  U1
4   H_6h b1  U1
5  D_24h b1  U1
6  L_24h b1  U1
7  M_24h b1  U1
8  H_24h b1  U1
9   D_6h b1  U2
10  L_6h b1  U2
11  M_6h b1  U2
12  H_6h b1  U2
13 D_24h b1  U2
14 L_24h b1  U2
15 M_24h b1  U2
16 H_24h b1  U2
...
81  D_6h b4 U11
82  L_6h b4 U11
83  M_6h b4 U11
84  H_6h b4 U11
85 D_24h b4 U11
86 L_24h b4 U11
87 M_24h b4 U11
88 H_24h b4 U11
89  D_6h b4 U12
90  L_6h b4 U12
91  M_6h b4 U12
92  H_6h b4 U12
93 D_24h b4 U12
94 L_24h b4 U12
95 M_24h b4 U12
96 H_24h b4 U12

Basically, 12 patients enrolled in a dosage (L, M, H; D is the control ) dependent time-course (6h and 24h) experiment. We are interested in the difference between treatment and controls (eg. "L_6h" vs "D_6h"). I have used a simple design, design(dds) <- formula(~ id + gr_2), where id is the patients, and gr_2 represents condtions ("D_6h", "L_6h", etc)

However, later we found there is batch effect (indicated in column b in the s_meta, 4 subgroups). Simply change the design to design(dds) <- formula(~ id + b + gr_2) caused "Matrix not full rank" error.

Following the section of "model-matrix-not-full-rank" of link,

we get the new design:

#- Set the id number
s_meta$idn <- rep(c(1:2, 1:3, 1:4, 1:3), each = 8) %>% factor

s_meta
     grp  b  id idn
1   D_6h b1  U1   1
2   L_6h b1  U1   1
3   M_6h b1  U1   1
4   H_6h b1  U1   1
5  D_24h b1  U1   1
6  L_24h b1  U1   1
7  M_24h b1  U1   1
8  H_24h b1  U1   1
9   D_6h b1  U2   2
10  L_6h b1  U2   2
11  M_6h b1  U2   2
12  H_6h b1  U2   2
13 D_24h b1  U2   2
14 L_24h b1  U2   2
15 M_24h b1  U2   2
16 H_24h b1  U2   2
...
81  D_6h b4 U11   2
82  L_6h b4 U11   2
83  M_6h b4 U11   2
84  H_6h b4 U11   2
85 D_24h b4 U11   2
86 L_24h b4 U11   2
87 M_24h b4 U11   2
88 H_24h b4 U11   2
89  D_6h b4 U12   3
90  L_6h b4 U12   3
91  M_6h b4 U12   3
92  H_6h b4 U12   3
93 D_24h b4 U12   3
94 L_24h b4 U12   3
95 M_24h b4 U12   3
96 H_24h b4 U12   3

Modifications of designs.

#- assign to the colData
dds$b <- s_meta$b
dds$idn <- s_meta$idn
dds$grp <- s_meta$grp

#- rm zero columns as the data is unbalanced. Also no intercept.
m1     <- model.matrix(~ 0 + b + b:idn + b:grp, colData(dds))
a_zero <- apply(m1, 2, function(x) all(x == 0))
idx    <- which(a_zero)
m1     <- m1[, -idx]

#- !! new design. betaPrior is set to F.
design(dds) <- formula(~ 0 + b + b:idn + b:grp)
dds <- DESeq(dds, parallel = TRUE, full = m1, , betaPrior = FALSE)

And names

resultsNames(dds)

#  [1] "bb1"             "bb2"             "bb3"             "bb4"
#  [5] "bb1.idn2"        "bb2.idn2"        "bb3.idn2"        "bb4.idn2"
#  [9] "bb2.idn3"        "bb3.idn3"        "bb4.idn3"        "bb3.idn4"
# [13] "bb1.grpD_24h" "bb2.grpD_24h" "bb3.grpD_24h" "bb4.grpD_24h"
# [17] "bb1.grpL_6h"  "bb2.grpL_6h"  "bb3.grpL_6h"  "bb4.grpL_6h"
# [21] "bb1.grpL_24h" "bb2.grpL_24h" "bb3.grpL_24h" "bb4.grpL_24h"
# [25] "bb1.grpM_6h"  "bb2.grpM_6h"  "bb3.grpM_6h"  "bb4.grpM_6h"
# [29] "bb1.grpM_24h" "bb2.grpM_24h" "bb3.grpM_24h" "bb4.grpM_24h"
# [33] "bb1.grpH_6h"  "bb2.grpH_6h"  "bb3.grpH_6h"  "bb4.grpH_6h"
# [37] "bb1.grpH_24h" "bb2.grpH_24h" "bb3.grpH_24h" "bb4.grpH_24h"

We are interested in the difference between treatment and control. To get the significant test results of H_6h vs D_6h (The denominator groups are ignored):

#- H_6h vs. D_6h. listValues is important otherwise the fold changes do NOT reflect average effect.
#- In the examples of results with list, the listValues is NOT set as interaction effect is supposed to add to main.
xx_1 <- results(dds, list(c("bb1:grpH_6", "bb2:grpH_6", "bb3:grpH_6", "bb4:grpH_6")), listValues =  c(1/4, -1/4))

To get the significant test results of H_24h vs D_24h (the denominator groups are sepecified)

#- H_24h vs. D_24h
xx_2 <- results(dds, list(c("bb1.grpH_24h", "bb2.grpH_24h", "bb3.grpH_24h", "bb4.grpH_24h"),
                          c("bb1.grp_24h", "bb2.grpD_24h", "bb3.grpD_24h", "bb4.grpD_24h")), listValues =  c(1/4, -1/4))

That's first time I dealt wit the complex design and contrast, and I am not sure whether I have done it properly.

Please let me know if there are any issues, and general suggestions are greatly appreciated! Thanks a lot!

Session

sessionInfo( )
R version 4.3.0 (2023-04-21)
Platform: aarch64-apple-darwin20 (64-bit)
Running under: macOS Ventura 13.6

Matrix products: default
BLAS:   /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/lib/libRblas.0.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/lib/libRlapack.dylib;  LAPACK version 3.11.0

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

time zone: Europe/Berlin
tzcode source: internal

attached base packages:
[1] stats4    stats     graphics  grDevices utils     datasets  methods
[8] base

other attached packages:
 [1] BiocParallel_1.33.11        DESeq2_1.40.1
 [3] SummarizedExperiment_1.29.1 Biobase_2.59.0
 [5] MatrixGenerics_1.12.0       matrixStats_1.0.0
 [7] GenomicRanges_1.51.4        GenomeInfoDb_1.36.0
 [9] IRanges_2.33.1              S4Vectors_0.37.4
[11] BiocGenerics_0.46.0         future.apply_1.11.0
[13] future_1.32.0               data.table_1.14.10
[15] magrittr_2.0.3              patchwork_1.1.2
[17] ggplot2_3.4.2               ll_0.1.0
[19] colorout_1.2-1

loaded via a namespace (and not attached):
 [1] utf8_1.2.3              generics_0.1.3          bitops_1.0-7
 [4] lattice_0.21-8          listenv_0.9.0           digest_0.6.31
 [7] grid_4.3.0              iterators_1.0.14        foreach_1.5.2
[10] Matrix_1.5-4            fansi_1.0.4             scales_1.2.1
[13] codetools_0.2-19        RApiSerialize_0.1.2     cli_3.6.1
[16] crayon_1.5.2            rlang_1.1.1             XVector_0.39.0
[19] parallelly_1.36.0       munsell_0.5.0           withr_2.5.0
[22] DelayedArray_0.26.3     S4Arrays_1.0.1          qs_0.25.5
[25] tools_4.3.0             parallel_4.3.0          dplyr_1.1.2
[28] colorspace_2.1-0        locfit_1.5-9.7          GenomeInfoDbData_1.2.10
[31] globals_0.16.2          vctrs_0.6.2             R6_2.5.1
[34] lifecycle_1.0.3         zlibbioc_1.45.0         stringfish_0.15.8
[37] pkgconfig_2.0.3         RcppParallel_5.1.7      pillar_1.9.0
[40] gtable_0.3.3            Rcpp_1.0.10             glue_1.6.2
[43] xfun_0.39               tibble_3.2.1            tidyselect_1.2.0
[46] knitr_1.42              compiler_4.3.0          RCurl_1.98-1.12

DESeq2 • 753 views

ADD COMMENT • link 17 months ago • updated 16 months ago ccshao ▴ 70

score 0 · Answer 1 · 2024-02-17

0

Entering edit mode

Michael Love 43k

@mikelove

Last seen 24 days ago

United States

For guidance on statistical analysis plan, I recommend to work with a local statistician or someone familiar with linear models in R.

ADD COMMENT • link 17 months ago Michael Love 43k

0

Entering edit mode

Thanks for your ideas. I talked with one of my colleagues about the data and we think the model looks fine. A few updates have been made to my codes (see posts).

ADD REPLY • link 16 months ago ccshao ▴ 70