help checking the design matrix and contrast in DESeq2 "not full rank" scenario
1
0
Entering edit mode
ccshao ▴ 70
@shao-chunxuan-6243
Last seen 8 months ago
Germany

Hi Community, I would like to get feedback about model designs and contrast dealing the issue of "Model not full rank" in DESeq2.

The sample meta info:

library(magrittr)
s_meta <- data.frame(grp = paste(rep(c("D", "L", "M", "H"), times = 2), rep(c("6h", "24h"), each = 4), sep = "_") %>%
                     rep(times = 12) %>%
                     factor(levels = c("D_6h", "L_6h", "M_6h", "H_6h", "D_24h", "L_24h", "M_24h", "H_24h")),
                     b = rep(paste0("b", 1:4), times = c(16, 24, 32, 24)) %>% factor,
                     id = paste0("U", rep(1:12, each = 8)))

Patients info:

s_meta
     grp  b  id
1   D_6h b1  U1
2   L_6h b1  U1
3   M_6h b1  U1
4   H_6h b1  U1
5  D_24h b1  U1
6  L_24h b1  U1
7  M_24h b1  U1
8  H_24h b1  U1
9   D_6h b1  U2
10  L_6h b1  U2
11  M_6h b1  U2
12  H_6h b1  U2
13 D_24h b1  U2
14 L_24h b1  U2
15 M_24h b1  U2
16 H_24h b1  U2
...
81  D_6h b4 U11
82  L_6h b4 U11
83  M_6h b4 U11
84  H_6h b4 U11
85 D_24h b4 U11
86 L_24h b4 U11
87 M_24h b4 U11
88 H_24h b4 U11
89  D_6h b4 U12
90  L_6h b4 U12
91  M_6h b4 U12
92  H_6h b4 U12
93 D_24h b4 U12
94 L_24h b4 U12
95 M_24h b4 U12
96 H_24h b4 U12

Basically, 12 patients enrolled in a dosage (L, M, H; D is the control ) dependent time-course (6h and 24h) experiment. We are interested in the difference between treatment and controls (eg. "L_6h" vs "D_6h"). I have used a simple design, design(dds) <- formula(~ id + gr_2), where id is the patients, and gr_2 represents condtions ("D_6h", "L_6h", etc)

However, later we found there is batch effect (indicated in column b in the s_meta, 4 subgroups). Simply change the design to design(dds) <- formula(~ id + b + gr_2) caused "Matrix not full rank" error.

Following the section of "model-matrix-not-full-rank" of link,

we get the new design:

#- Set the id number
s_meta$idn <- rep(c(1:2, 1:3, 1:4, 1:3), each = 8) %>% factor

s_meta
     grp  b  id idn
1   D_6h b1  U1   1
2   L_6h b1  U1   1
3   M_6h b1  U1   1
4   H_6h b1  U1   1
5  D_24h b1  U1   1
6  L_24h b1  U1   1
7  M_24h b1  U1   1
8  H_24h b1  U1   1
9   D_6h b1  U2   2
10  L_6h b1  U2   2
11  M_6h b1  U2   2
12  H_6h b1  U2   2
13 D_24h b1  U2   2
14 L_24h b1  U2   2
15 M_24h b1  U2   2
16 H_24h b1  U2   2
...
81  D_6h b4 U11   2
82  L_6h b4 U11   2
83  M_6h b4 U11   2
84  H_6h b4 U11   2
85 D_24h b4 U11   2
86 L_24h b4 U11   2
87 M_24h b4 U11   2
88 H_24h b4 U11   2
89  D_6h b4 U12   3
90  L_6h b4 U12   3
91  M_6h b4 U12   3
92  H_6h b4 U12   3
93 D_24h b4 U12   3
94 L_24h b4 U12   3
95 M_24h b4 U12   3
96 H_24h b4 U12   3

Modifications of designs.

#- assign to the colData
dds$b <- s_meta$b
dds$idn <- s_meta$idn
dds$grp <- s_meta$grp

#- rm zero columns as the data is unbalanced. Also no intercept.
m1     <- model.matrix(~ 0 + b + b:idn + b:grp, colData(dds))
a_zero <- apply(m1, 2, function(x) all(x == 0))
idx    <- which(a_zero)
m1     <- m1[, -idx]

#- !! new design. betaPrior is set to F.
design(dds) <- formula(~ 0 + b + b:idn + b:grp)
dds <- DESeq(dds, parallel = TRUE, full = m1, , betaPrior = FALSE)

And names

resultsNames(dds)

#  [1] "bb1"             "bb2"             "bb3"             "bb4"
#  [5] "bb1.idn2"        "bb2.idn2"        "bb3.idn2"        "bb4.idn2"
#  [9] "bb2.idn3"        "bb3.idn3"        "bb4.idn3"        "bb3.idn4"
# [13] "bb1.grpD_24h" "bb2.grpD_24h" "bb3.grpD_24h" "bb4.grpD_24h"
# [17] "bb1.grpL_6h"  "bb2.grpL_6h"  "bb3.grpL_6h"  "bb4.grpL_6h"
# [21] "bb1.grpL_24h" "bb2.grpL_24h" "bb3.grpL_24h" "bb4.grpL_24h"
# [25] "bb1.grpM_6h"  "bb2.grpM_6h"  "bb3.grpM_6h"  "bb4.grpM_6h"
# [29] "bb1.grpM_24h" "bb2.grpM_24h" "bb3.grpM_24h" "bb4.grpM_24h"
# [33] "bb1.grpH_6h"  "bb2.grpH_6h"  "bb3.grpH_6h"  "bb4.grpH_6h"
# [37] "bb1.grpH_24h" "bb2.grpH_24h" "bb3.grpH_24h" "bb4.grpH_24h"

We are interested in the difference between treatment and control. To get the significant test results of H_6h vs D_6h (The denominator groups are ignored):

#- H_6h vs. D_6h. listValues is important otherwise the fold changes do NOT reflect average effect.
#- In the examples of results with list, the listValues is NOT set as interaction effect is supposed to add to main.
xx_1 <- results(dds, list(c("bb1:grpH_6", "bb2:grpH_6", "bb3:grpH_6", "bb4:grpH_6")), listValues =  c(1/4, -1/4))

To get the significant test results of H_24h vs D_24h (the denominator groups are sepecified)

#- H_24h vs. D_24h
xx_2 <- results(dds, list(c("bb1.grpH_24h", "bb2.grpH_24h", "bb3.grpH_24h", "bb4.grpH_24h"),
                          c("bb1.grp_24h", "bb2.grpD_24h", "bb3.grpD_24h", "bb4.grpD_24h")), listValues =  c(1/4, -1/4))

That's first time I dealt wit the complex design and contrast, and I am not sure whether I have done it properly.

Please let me know if there are any issues, and general suggestions are greatly appreciated! Thanks a lot!

Session

sessionInfo( )
R version 4.3.0 (2023-04-21)
Platform: aarch64-apple-darwin20 (64-bit)
Running under: macOS Ventura 13.6

Matrix products: default
BLAS:   /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/lib/libRblas.0.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/lib/libRlapack.dylib;  LAPACK version 3.11.0

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

time zone: Europe/Berlin
tzcode source: internal

attached base packages:
[1] stats4    stats     graphics  grDevices utils     datasets  methods
[8] base

other attached packages:
 [1] BiocParallel_1.33.11        DESeq2_1.40.1
 [3] SummarizedExperiment_1.29.1 Biobase_2.59.0
 [5] MatrixGenerics_1.12.0       matrixStats_1.0.0
 [7] GenomicRanges_1.51.4        GenomeInfoDb_1.36.0
 [9] IRanges_2.33.1              S4Vectors_0.37.4
[11] BiocGenerics_0.46.0         future.apply_1.11.0
[13] future_1.32.0               data.table_1.14.10
[15] magrittr_2.0.3              patchwork_1.1.2
[17] ggplot2_3.4.2               ll_0.1.0
[19] colorout_1.2-1

loaded via a namespace (and not attached):
 [1] utf8_1.2.3              generics_0.1.3          bitops_1.0-7
 [4] lattice_0.21-8          listenv_0.9.0           digest_0.6.31
 [7] grid_4.3.0              iterators_1.0.14        foreach_1.5.2
[10] Matrix_1.5-4            fansi_1.0.4             scales_1.2.1
[13] codetools_0.2-19        RApiSerialize_0.1.2     cli_3.6.1
[16] crayon_1.5.2            rlang_1.1.1             XVector_0.39.0
[19] parallelly_1.36.0       munsell_0.5.0           withr_2.5.0
[22] DelayedArray_0.26.3     S4Arrays_1.0.1          qs_0.25.5
[25] tools_4.3.0             parallel_4.3.0          dplyr_1.1.2
[28] colorspace_2.1-0        locfit_1.5-9.7          GenomeInfoDbData_1.2.10
[31] globals_0.16.2          vctrs_0.6.2             R6_2.5.1
[34] lifecycle_1.0.3         zlibbioc_1.45.0         stringfish_0.15.8
[37] pkgconfig_2.0.3         RcppParallel_5.1.7      pillar_1.9.0
[40] gtable_0.3.3            Rcpp_1.0.10             glue_1.6.2
[43] xfun_0.39               tibble_3.2.1            tidyselect_1.2.0
[46] knitr_1.42              compiler_4.3.0          RCurl_1.98-1.12
DESeq2 • 619 views
ADD COMMENT
0
Entering edit mode
@mikelove
Last seen 6 days ago
United States

For guidance on statistical analysis plan, I recommend to work with a local statistician or someone familiar with linear models in R.

ADD COMMENT
0
Entering edit mode

Thanks for your ideas. I talked with one of my colleagues about the data and we think the model looks fine. A few updates have been made to my codes (see posts).

ADD REPLY

Login before adding your answer.

Traffic: 534 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6