Hello,
I have a question about the complex study design in the R ALDEx2 package (https://bioconductor.org/packages/devel/bioc/vignettes/ALDEx2/inst/doc/ALDEx2_vignette.html#46_Complex_study_designs_and_the_aldexglm_module). My goal is to compare more than two groups (e.g. two or more treatments vs control). In a simple case with two groups, we use a vector with two groups:
data(selex)
#subset only the last 400 features for efficiency
selex.sub <- selex[1:400,]
conds <- c(rep("NS", 7), rep("S", 7))
Where seven samples are from the "NS" (non-selective) and the other seven are from the "S" (selective) condition. This is clear. But I don't understand why we don't do the same in the complex case with a model matrix:
data(selex)
selex.sub <- selex[1:500, ]
covariates <- data.frame("A" = sample(0:1, 14, replace = TRUE),
"B" = c(rep(0, 7), rep(1, 7)),
"Z" = sample(c(1,2,3), 14, replace=TRUE))
mm <- model.matrix(~ A + Z + B, covariates)
In this case, the model matrix will be the following (I set the random seed to 1):
(Intercept) A Z B
1 1 0 1 0
2 1 1 2 0
3 1 0 2 0
4 1 0 2 0
5 1 1 2 0
6 1 0 3 0
7 1 0 1 0
8 1 0 3 1
9 1 1 1 1
10 1 1 1 1
11 1 0 1 1
12 1 0 1 1
13 1 0 2 1
14 1 0 1 1
attr(,"assign")
[1] 0 1 2 3
Why do we use the sample() command? Why do we allow Z to take values 2 or 3? Shouldn't we have a matrix where values don't overlap, like:
(Intercept) A B Z
1 1 1 0 0
2 1 1 0 0
3 1 1 0 0
4 1 1 0 0
5 1 0 1 0
6 1 0 1 0
7 1 0 1 0
8 1 0 1 0
9 1 0 0 1
10 1 0 0 1
11 1 0 0 1
12 1 0 0 1
13 1 0 0 1
14 1 0 0 1
attr(,"assign")
[1] 0 1 2 3
I noticed that if I follow the tutorial, the output of aldex.glm.effect(x) is a list of two elements (A and B). Whereas if I use my matrix of non-overlapping values, aldex.glm.effect(x) gives a list of 3 (A, B, Z). What could be the reason? And what do we use here as a reference group (I mean, what is considered a control group)? It might sound like a simple question, but I'm a beginner at ALDEx2. I would be grateful for any help or insight.
If it helps, here is the information from sessionInfo():
> sessionInfo()
R version 4.3.0 (2023-04-21 ucrt)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 19045)
Matrix products: default
locale:
[1] LC_COLLATE=English_United States.utf8 LC_CTYPE=English_United States.utf8
[3] LC_MONETARY=English_United States.utf8 LC_NUMERIC=C
[5] LC_TIME=English_United States.utf8
time zone: Asia/Tokyo
tzcode source: internal
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] ALDEx2_1.32.0 zCompositions_1.4.0-1 truncnorm_1.0-9 NADA_1.6-2
[5] survival_3.5-5 MASS_7.3-60
loaded via a namespace (and not attached):
[1] utf8_1.2.3 generics_0.1.3 bitops_1.0-7
[4] xml2_1.3.4 lattice_0.21-8 magrittr_2.0.3
[7] grid_4.3.0 Matrix_1.5-5 GenomeInfoDb_1.36.0
[10] ggtext_0.1.2 fansi_1.0.4 scales_1.2.1
[13] codetools_0.2-19 cli_3.6.1 crayon_1.5.2
[16] rlang_1.1.1 XVector_0.40.0 Biobase_2.60.0
[19] munsell_0.5.0 splines_4.3.0 DelayedArray_0.26.3
[22] S4Arrays_1.0.4 tools_4.3.0 parallel_4.3.0
[25] BiocParallel_1.34.2 dplyr_1.1.2 colorspace_2.1-0
[28] ggplot2_3.4.2 GenomeInfoDbData_1.2.10 SummarizedExperiment_1.30.2
[31] RcppZiggurat_0.1.6 Rfast_2.0.7 BiocGenerics_0.46.0
[34] vctrs_0.6.2 R6_2.5.1 matrixStats_0.63.0
[37] stats4_4.3.0 lifecycle_1.0.3 zlibbioc_1.46.0
[40] S4Vectors_0.38.1 IRanges_2.34.0 pkgconfig_2.0.3
[43] pillar_1.9.0 gtable_0.3.3 glue_1.6.2
[46] Rcpp_1.0.10 tibble_3.2.1 GenomicRanges_1.52.0
[49] tidyselect_1.2.0 rstudioapi_0.14 MatrixGenerics_1.12.2
[52] compiler_4.3.0 RCurl_1.98-1.12 gridtext_0.1.5