Hello!
I am hoping to check that A) the complex model I've had to fit to this dataset is correct and that B) my post-hoc double checking of results significance is reasonable.
I'm working on an analysis where we're interested in understanding the differences between two disease subtypes (GroupA and GroupB) at two time points (Visit1 and Visit2) while controlling for the individual. We have an unbalanced number of observations between groups (nGroupA = 6, nGroupB = 8). The corresponding coldata matrix looks like this:
suppressMessages(library(tidyverse))
suppressMessages(library(DESeq2))
set.seed(8675309)
samples <- tibble(
ID = rep(seq(1,14), each = 2),
Group = c(rep("GroupA", 12), rep("GroupB", 16)),
Visit = c(rep(c("Visit1", "Visit2"), 14))
) %>% data.frame
samples
ID Group Visit
1 1 GroupA Visit1
2 1 GroupA Visit2
3 2 GroupA Visit1
4 2 GroupA Visit2
5 3 GroupA Visit1
6 3 GroupA Visit2
7 4 GroupA Visit1
8 4 GroupA Visit2
9 5 GroupA Visit1
10 5 GroupA Visit2
11 6 GroupA Visit1
12 6 GroupA Visit2
13 7 GroupB Visit1
14 7 GroupB Visit2
15 8 GroupB Visit1
16 8 GroupB Visit2
17 9 GroupB Visit1
18 9 GroupB Visit2
19 10 GroupB Visit1
20 10 GroupB Visit2
21 11 GroupB Visit1
22 11 GroupB Visit2
23 12 GroupB Visit1
24 12 GroupB Visit2
25 13 GroupB Visit1
26 13 GroupB Visit2
27 14 GroupB Visit1
28 14 GroupB Visit2
A simple fit of this model that controls for individuals and interactions (~ ID + Group + Visit + Group:Visit
)
produces the "not full rank error" so I followed the docs to create a model where individuals further IDed by their
Group
membership in sequence:
# GroupA ID 1-6, GroupB ID 1-8, two visits each
samples$GroupID <- as.factor(c(rep(seq(1, 6), each = 2), rep(seq(1, 8), each = 2)))
# add main and interaction effects for desired conditions
customMatrix <- model.matrix(~Group + Visit + Group:GroupID + Group:Visit, samples) %>%
data.frame
As noted in the docs, this will produce some columns of pure 0s and create the "not full rank" error again anyway give there is no observation 7 or 8 in GroupA
, so we remove them.
# filter out the empty columns
customMatrix <- customMatrix %>%
data.frame %>%
dplyr::select(-c("GroupGroupA.GroupID7","GroupGroupA.GroupID8")) %>%
as.matrix
We now have a usable matrix.
customMatrix # too big for forum
We can now run DESeq2:
### optional: generate a fake counts df with fake gene names so this can be run top-bottom
counts <- matrix(nrow = 0, ncol = 28)
# make ten fake genes with count data
for(i in 1:10) {
# make them appear different so DESeq2 will fit
x <- rbinom(14, sample(1000:100000, 1), runif(1, 0, 1))
y <- rbinom(14, sample(1000:100, 1), runif(1, 0, 1))
# create a row
counts <- rbind(counts, c(x,y))
# fake a gene name
rownames(counts)[i] <- paste(sample(LETTERS, 10, TRUE), collapse = "")
}
counts <- data.frame(counts)
colnames(counts) <- paste0("Subject", samples$ID)
# run DESeq2
dds <- DESeq(
DESeqDataSetFromMatrix(
countData = counts,
colData = samples,
design = customMatrix
)
)
Because we're interested in genes that are significant after controlling for all these factors I assumed we should submit a contrast vector that's full of 1s to get our hits from results()
:
# consider whole model model for significance testing
res <- results(dds, contrast = rep(1, length(resultsNames(dds))))
# or:
# res <- results(dds, contrast = list(resultsNames(dds)))
# get significant hits
sigRes <- res %>%
data.frame %>%
filter(padj < 0.05)
I inspected the plots to sanity check my results. Some of them looked somewhat unconvincing to me by eye, so I applied some post-hoc t-tests to ensure I was getting significant hits. I extracted the counts using plotCounts()
, and t.tested in terms of Group
or Visit
(I did both out of curiosity).
### for all those things DESeq2 is calling significant, do a t-test and a sanity plot
# look up each significant gene in the res table so we can call plotCounts on it
for(i in match(rownames(sigRes), rownames(res))) {
# gene name
gene <- rownames(res)[i]
# get the DESeq2 normalized counts
current <- plotCounts(dds, gene = i, intgroup = c("Group", "Visit"), returnData = TRUE)
# do a t-test for Group or Visit
tTest <- t.test(count ~ Group, data = current)
# save or plot the ones that pass
if(tTest$p.value < 0.05) {
### save a list of hits that passed significance via DESeq2 and post-hoc test
print("this gene passed round two")
} else {
### pass on this gene to be safe
print("this gene did not pass round two")
}
}
My questions are :
1) Is this the correct way to extract results for a complex, user-defined matrix if we want to control for all the conditions mentioned? 2) Is extra filtering by t-test for features of interest in this manner appropriate?
Thank you!
sessionInfo( )
R version 3.6.2 (2019-12-12)
Platform: x86_64-apple-darwin15.6.0 (64-bit)
Running under: macOS Catalina 10.15.7
Matrix products: default
BLAS: /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/3.6/Resources/lib/libRlapack.dylib
locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
attached base packages:
[1] parallel stats4 stats graphics grDevices utils datasets methods base
other attached packages:
[1] DESeq2_1.26.0 SummarizedExperiment_1.16.1 DelayedArray_0.12.3 BiocParallel_1.20.1
[5] matrixStats_0.57.0 Biobase_2.46.0 GenomicRanges_1.38.0 GenomeInfoDb_1.22.1
[9] IRanges_2.20.2 S4Vectors_0.24.4 BiocGenerics_0.32.0 forcats_0.5.0
[13] stringr_1.4.0 dplyr_1.0.2 purrr_0.3.4 readr_1.4.0
[17] tidyr_1.1.2 tibble_3.0.4 ggplot2_3.3.2 tidyverse_1.3.0
loaded via a namespace (and not attached):
[1] bitops_1.0-6 fs_1.5.0 bit64_4.0.5 lubridate_1.7.9 RColorBrewer_1.1-2
[6] httr_1.4.2 tools_3.6.2 backports_1.1.10 R6_2.4.1 rpart_4.1-15
[11] Hmisc_4.4-1 DBI_1.1.0 colorspace_1.4-1 nnet_7.3-14 withr_2.3.0
[16] tidyselect_1.1.0 gridExtra_2.3 bit_4.0.4 compiler_3.6.2 cli_2.1.0
[21] rvest_0.3.6 htmlTable_2.1.0 xml2_1.3.2 scales_1.1.1 checkmate_2.0.0
[26] genefilter_1.68.0 digest_0.6.27 foreign_0.8-76 XVector_0.26.0 base64enc_0.1-3
[31] jpeg_0.1-8.1 pkgconfig_2.0.3 htmltools_0.5.0 dbplyr_1.4.4 htmlwidgets_1.5.2
[36] rlang_0.4.8 readxl_1.3.1 RSQLite_2.2.1 rstudioapi_0.11 generics_0.0.2
[41] jsonlite_1.7.1 RCurl_1.98-1.2 magrittr_1.5 GenomeInfoDbData_1.2.2 Formula_1.2-3
[46] Matrix_1.2-18 Rcpp_1.0.5 munsell_0.5.0 fansi_0.4.1 lifecycle_0.2.0
[51] stringi_1.5.3 zlibbioc_1.32.0 grid_3.6.2 blob_1.2.1 crayon_1.3.4
[56] lattice_0.20-41 haven_2.3.1 splines_3.6.2 annotate_1.64.0 hms_0.5.3
[61] locfit_1.5-9.4 knitr_1.30 pillar_1.4.6 geneplotter_1.64.0 XML_3.99-0.3
[66] reprex_0.3.0 glue_1.4.2 latticeExtra_0.6-29 data.table_1.12.8 modelr_0.1.8
[71] png_0.1-7 vctrs_0.3.4 cellranger_1.1.0 gtable_0.3.0 assertthat_0.2.1
[76] xfun_0.18 xtable_1.8-4 broom_0.7.1 survival_3.2-7 memoise_1.1.0
[81] AnnotationDbi_1.48.0 cluster_2.1.0 ellipsis_0.3.1
Thank you for your time.