Question

Extracting DESeq2 results when using a complex, user-specified matrix and post-hoc significance testing

0

Entering edit mode

thadryan • 0

@thadryan-23400

Last seen 2.6 years ago

New England

Hello!

I am hoping to check that A) the complex model I've had to fit to this dataset is correct and that B) my post-hoc double checking of results significance is reasonable.

I'm working on an analysis where we're interested in understanding the differences between two disease subtypes (GroupA and GroupB) at two time points (Visit1 and Visit2) while controlling for the individual. We have an unbalanced number of observations between groups (nGroupA = 6, nGroupB = 8). The corresponding coldata matrix looks like this:


suppressMessages(library(tidyverse))
suppressMessages(library(DESeq2))

set.seed(8675309)

samples <- tibble(
  ID = rep(seq(1,14), each = 2),
  Group = c(rep("GroupA", 12), rep("GroupB", 16)),
  Visit = c(rep(c("Visit1", "Visit2"), 14))
) %>% data.frame

samples

   ID  Group  Visit
1   1 GroupA Visit1
2   1 GroupA Visit2
3   2 GroupA Visit1
4   2 GroupA Visit2
5   3 GroupA Visit1
6   3 GroupA Visit2
7   4 GroupA Visit1
8   4 GroupA Visit2
9   5 GroupA Visit1
10  5 GroupA Visit2
11  6 GroupA Visit1
12  6 GroupA Visit2
13  7 GroupB Visit1
14  7 GroupB Visit2
15  8 GroupB Visit1
16  8 GroupB Visit2
17  9 GroupB Visit1
18  9 GroupB Visit2
19 10 GroupB Visit1
20 10 GroupB Visit2
21 11 GroupB Visit1
22 11 GroupB Visit2
23 12 GroupB Visit1
24 12 GroupB Visit2
25 13 GroupB Visit1
26 13 GroupB Visit2
27 14 GroupB Visit1
28 14 GroupB Visit2

A simple fit of this model that controls for individuals and interactions (~ ID + Group + Visit + Group:Visit) produces the "not full rank error" so I followed the docs to create a model where individuals further IDed by their Group membership in sequence:


# GroupA ID 1-6, GroupB ID 1-8, two visits each
samples$GroupID <- as.factor(c(rep(seq(1, 6), each = 2), rep(seq(1, 8), each = 2)))

# add main and interaction effects for desired conditions 
customMatrix <- model.matrix(~Group + Visit + Group:GroupID + Group:Visit, samples) %>%
  data.frame

As noted in the docs, this will produce some columns of pure 0s and create the "not full rank" error again anyway give there is no observation 7 or 8 in GroupA, so we remove them.

# filter out the empty columns
customMatrix <- customMatrix %>%
  data.frame %>%
  dplyr::select(-c("GroupGroupA.GroupID7","GroupGroupA.GroupID8")) %>%
  as.matrix

We now have a usable matrix.

customMatrix # too big for forum

We can now run DESeq2:


### optional: generate a fake counts df with fake gene names so this can be run top-bottom
counts <- matrix(nrow = 0, ncol = 28)

# make ten fake genes with count data
for(i in 1:10) {

  # make them appear different so DESeq2 will fit
  x <- rbinom(14, sample(1000:100000, 1), runif(1, 0, 1))
  y <- rbinom(14, sample(1000:100, 1), runif(1, 0, 1))

  # create a row
  counts <- rbind(counts, c(x,y))

  # fake a gene name
  rownames(counts)[i] <- paste(sample(LETTERS, 10, TRUE), collapse = "")
}

counts <- data.frame(counts)

colnames(counts) <- paste0("Subject", samples$ID)

# run DESeq2
dds <- DESeq(
  DESeqDataSetFromMatrix(
    countData = counts,
    colData = samples,
    design = customMatrix
  )
)

Because we're interested in genes that are significant after controlling for all these factors I assumed we should submit a contrast vector that's full of 1s to get our hits from results():


# consider whole model model for significance testing
res <- results(dds, contrast = rep(1, length(resultsNames(dds))))
# or:
# res <- results(dds, contrast = list(resultsNames(dds)))

# get significant hits
sigRes <- res %>%
  data.frame %>% 
  filter(padj < 0.05)

I inspected the plots to sanity check my results. Some of them looked somewhat unconvincing to me by eye, so I applied some post-hoc t-tests to ensure I was getting significant hits. I extracted the counts using plotCounts(), and t.tested in terms of Group or Visit (I did both out of curiosity).


### for all those things DESeq2 is calling significant, do a t-test and a sanity plot
# look up each significant gene in the res table so we can call plotCounts on it
for(i in match(rownames(sigRes), rownames(res))) {

  # gene name
  gene <- rownames(res)[i]

  # get the DESeq2 normalized counts
  current <- plotCounts(dds, gene = i, intgroup = c("Group", "Visit"), returnData = TRUE)

  # do a t-test for Group or Visit
  tTest <- t.test(count ~ Group, data = current)

  # save or plot the ones that pass
  if(tTest$p.value < 0.05) {
    ### save a list of hits that passed significance via DESeq2 and post-hoc test
    print("this gene passed round two")
  } else {
    ### pass on this gene to be safe
    print("this gene did not pass round two")
  }
}

My questions are :

1) Is this the correct way to extract results for a complex, user-defined matrix if we want to control for all the conditions mentioned? 2) Is extra filtering by t-test for features of interest in this manner appropriate?

Thank you!


sessionInfo( )

R version 3.6.2 (2019-12-12)
Platform: x86_64-apple-darwin15.6.0 (64-bit)
Running under: macOS Catalina 10.15.7

Matrix products: default
BLAS:   /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/3.6/Resources/lib/libRlapack.dylib

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] parallel  stats4    stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] DESeq2_1.26.0               SummarizedExperiment_1.16.1 DelayedArray_0.12.3         BiocParallel_1.20.1        
 [5] matrixStats_0.57.0          Biobase_2.46.0              GenomicRanges_1.38.0        GenomeInfoDb_1.22.1        
 [9] IRanges_2.20.2              S4Vectors_0.24.4            BiocGenerics_0.32.0         forcats_0.5.0              
[13] stringr_1.4.0               dplyr_1.0.2                 purrr_0.3.4                 readr_1.4.0                
[17] tidyr_1.1.2                 tibble_3.0.4                ggplot2_3.3.2               tidyverse_1.3.0            

loaded via a namespace (and not attached):
 [1] bitops_1.0-6           fs_1.5.0               bit64_4.0.5            lubridate_1.7.9        RColorBrewer_1.1-2    
 [6] httr_1.4.2             tools_3.6.2            backports_1.1.10       R6_2.4.1               rpart_4.1-15          
[11] Hmisc_4.4-1            DBI_1.1.0              colorspace_1.4-1       nnet_7.3-14            withr_2.3.0           
[16] tidyselect_1.1.0       gridExtra_2.3          bit_4.0.4              compiler_3.6.2         cli_2.1.0             
[21] rvest_0.3.6            htmlTable_2.1.0        xml2_1.3.2             scales_1.1.1           checkmate_2.0.0       
[26] genefilter_1.68.0      digest_0.6.27          foreign_0.8-76         XVector_0.26.0         base64enc_0.1-3       
[31] jpeg_0.1-8.1           pkgconfig_2.0.3        htmltools_0.5.0        dbplyr_1.4.4           htmlwidgets_1.5.2     
[36] rlang_0.4.8            readxl_1.3.1           RSQLite_2.2.1          rstudioapi_0.11        generics_0.0.2        
[41] jsonlite_1.7.1         RCurl_1.98-1.2         magrittr_1.5           GenomeInfoDbData_1.2.2 Formula_1.2-3         
[46] Matrix_1.2-18          Rcpp_1.0.5             munsell_0.5.0          fansi_0.4.1            lifecycle_0.2.0       
[51] stringi_1.5.3          zlibbioc_1.32.0        grid_3.6.2             blob_1.2.1             crayon_1.3.4          
[56] lattice_0.20-41        haven_2.3.1            splines_3.6.2          annotate_1.64.0        hms_0.5.3             
[61] locfit_1.5-9.4         knitr_1.30             pillar_1.4.6           geneplotter_1.64.0     XML_3.99-0.3          
[66] reprex_0.3.0           glue_1.4.2             latticeExtra_0.6-29    data.table_1.12.8      modelr_0.1.8          
[71] png_0.1-7              vctrs_0.3.4            cellranger_1.1.0       gtable_0.3.0           assertthat_0.2.1      
[76] xfun_0.18              xtable_1.8-4           broom_0.7.1            survival_3.2-7         memoise_1.1.0         
[81] AnnotationDbi_1.48.0   cluster_2.1.0          ellipsis_0.3.1

DESeq2 • 651 views

ADD COMMENT • link 3.4 years ago thadryan • 0

score 0 · Answer 1 · 2020-11-28

I didn't review the statistical design -- I don't have time unfortunately to review user's statistical analysis but instead recommend that they work with local statisticians if they want statistical review of their analysis plan.

I wouldn't recommend performing t-test on top of the NB GLM tests here: we know that DESeq2 has much higher power than performing row-wise t-tests, due to its sharing of information across genes. So it's no longer running DESeq2 but a much less powered combination of tests. I'd instead recommend using either the FDR threshold or lfcThreshold if you want to have stricter criterion.