Hello,
I have previously been using DESeq2 as part of HOMER, but I think I need more control over the analysis of this experiment, and would love some input on how to set up the design for DESeq2 (plus I would learn!).
Experimental setup: RNA-seq (PE) of mice. I have wildtype (WT) and transgenic (TG) mice. Half of my TG mice develop periods of behavioral abnormalities. At four different time points 1 WT mouse, 1 "normal TG" (TGN) and 1 "disease TG" (TGD) has been sampled. This is summarized below:
SampleName | Geno | State | Mate | Run | Batch | Sum | |
C41BMACXX_PR0172_13A06_H1_L007 | WT1 | WT | N | A | C41BMACXX_PR0172_13A06_H1_L007 | A | WTN |
C41BMACXX_PR0172_17A04_H1_L008 | TGN1 | TG | N | B | C41BMACXX_PR0172_17A04_H1_L008 | A | TGN |
C41DJACXX_PR0172_21A06_H1_L001 | TGD1 | TG | D | C | C41DJACXX_PR0172_21A06_H1_L001 | A | TGD |
C41DJACXX_PR0172_25A04_H1_L002 | WT2 | WT | N | D | C41DJACXX_PR0172_25A04_H1_L002 | B | WTN |
C41DJACXX_PR0172_29A06_H1_L003 | TGN2 | TG | N | E | C41DJACXX_PR0172_29A06_H1_L003 | B | TGN |
C41DJACXX_PR0172_33A04_H1_L004 | TGD2 | TG | D | F | C41DJACXX_PR0172_33A04_H1_L004 | B | TGD |
C41DJACXX_PR0172_37A06_H1_L005 | WT3 | WT | N | G | C41DJACXX_PR0172_37A06_H1_L005 | C | WTN |
C41DJACXX_PR0172_41A04_H1_L006 | TGN3 | TG | N | D | C41DJACXX_PR0172_41A04_H1_L006 | C | TGN |
C41DJACXX_PR0172_45A06_H1_L007 | TGD3 | TG | D | E | C41DJACXX_PR0172_45A06_H1_L007 | C | TGD |
C41DJACXX_PR0172_49A04_H1_L008 | WT4 | WT | N | A | C41DJACXX_PR0172_49A04_H1_L008 | D | WTN |
C41JJACXX_PR0172_53A06_H1_L006 | TGN4 | TG | N | D | C41JJACXX_PR0172_53A06_H1_L006 | D | TGN |
C41JJACXX_PR0172_57A04_H1_L007 | TGD4 | TG | D | H | C41JJACXX_PR0172_57A04_H1_L007 | D | TGD |
Where "sampleName" is the uniqe name of each mouse, "Geno" is the genotype, "State" is normal or disease at the time of sampling, "Mate" is which litter the animal belongs to, "Run" is file name, "Batch" is sampling batch, and "Sum" is a collective of "Geno" and "State".
Now, I want to find genes that are differentially expressed in TG compared to WT as well as D compared to N mice. I would like to take into consideration the different sampling batches. After reading around at forums and the manual for DESeq2 I am sure I am not understanding how to properly decide on a design...
FYI, looking at PCA plot and heatmap there is no good separation of my samples based on anything. From the PCA plot I see that "Batch A" clusters somewhat far from the other samples and from the heatmap I see that "Batch D" cluster together while everything else is a hot mess. However, I am expecting only small differences between my conditions.
My first thought was to include an interaction effect between Geno and State, but that gave me an error (because all N animals are TG?)
>library(DESeq2) > dds = DESeqDataSet(se,design= ~Batch + Geno:State) Error in checkFullRank(modelMatrix) : the model matrix is not full rank, so the model cannot be fit as specified. Levels or combinations of levels without any samples have resulted in column(s) of zeros in the model matrix. Please read the vignette section 'Model matrix not full rank': vignette('DESeq2')
So instead I used following:
library(DESeq2) dds = DESeqDataSet(se,design= ~Batch+Sum)
Which has been producing OK results (as is I am getting a few differentially expressed genes). But then I can only get info on the "Sum" and not compare e.g. all WT to all TG using:
results(dds, contrast = c("Geno","TG","WT"), alpha = 0.05)
Which I would like to do.. (Maybe I am doing this wrong?)
I hope this has been somewhat clear. I hope for any help to guide a newbie in the right direction :)
Links to similar questions that I may have missed are also highly appreciated! Trying to learn how properly do this, but am finding it a bit challenging...
> sessionInfo() R version 3.4.2 (2017-09-28) Platform: x86_64-pc-linux-gnu (64-bit) Running under: Ubuntu 16.04.3 LTS Matrix products: default BLAS: /usr/lib/libblas/libblas.so.3.6.0 LAPACK: /usr/lib/lapack/liblapack.so.3.6.0 locale: [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C [3] LC_TIME=ja_JP.UTF-8 LC_COLLATE=en_US.UTF-8 [5] LC_MONETARY=ja_JP.UTF-8 LC_MESSAGES=en_US.UTF-8 [7] LC_PAPER=ja_JP.UTF-8 LC_NAME=C [9] LC_ADDRESS=C LC_TELEPHONE=C [11] LC_MEASUREMENT=ja_JP.UTF-8 LC_IDENTIFICATION=C attached base packages: [1] stats4 parallel stats graphics grDevices utils datasets [8] methods base other attached packages: [1] org.Mm.eg.db_3.5.0 genefilter_1.60.0 [3] ggbeeswarm_0.6.0 PoiClaClu_1.0.2 [5] RColorBrewer_1.1-2 pheatmap_1.0.8 [7] hexbin_1.27.1 ggplot2_2.2.1 [9] dplyr_0.7.4 DESeq2_1.18.1 [11] BiocParallel_1.12.0 BiocInstaller_1.28.0 [13] GenomicFeatures_1.30.0 AnnotationDbi_1.40.0 [15] GenomicAlignments_1.14.1 Rsamtools_1.30.0 [17] Biostrings_2.46.0 XVector_0.18.0 [19] SummarizedExperiment_1.8.0 DelayedArray_0.4.1 [21] matrixStats_0.52.2 Biobase_2.38.0 [23] GenomicRanges_1.30.0 GenomeInfoDb_1.14.0 [25] IRanges_2.12.0 S4Vectors_0.16.0 [27] BiocGenerics_0.24.0 loaded via a namespace (and not attached): [1] RMySQL_0.10.13 bit64_0.9-7 splines_3.4.2 [4] Formula_1.2-2 assertthat_0.2.0 latticeExtra_0.6-28 [7] blob_1.1.0 vipor_0.4.5 GenomeInfoDbData_0.99.1 [10] progress_1.1.2 RSQLite_2.0 backports_1.1.1 [13] lattice_0.20-35 glue_1.2.0 digest_0.6.12 [16] checkmate_1.8.5 colorspace_1.3-2 htmltools_0.3.6 [19] Matrix_1.2-11 plyr_1.8.4 XML_3.98-1.9 [22] pkgconfig_2.0.1 biomaRt_2.34.0 zlibbioc_1.24.0 [25] xtable_1.8-2 scales_0.5.0 tibble_1.3.4 [28] htmlTable_1.9 annotate_1.56.1 nnet_7.3-12 [31] lazyeval_0.2.1 survival_2.41-3 magrittr_1.5 [34] memoise_1.1.0 foreign_0.8-69 beeswarm_0.2.3 [37] tools_3.4.2 data.table_1.10.4-3 prettyunits_1.0.2 [40] stringr_1.2.0 munsell_0.4.3 locfit_1.5-9.1 [43] cluster_2.0.6 bindrcpp_0.2 compiler_3.4.2 [46] rlang_0.1.4 grid_3.4.2 RCurl_1.95-4.8 [49] htmlwidgets_0.9 labeling_0.3 bitops_1.0-6 [52] base64enc_0.1-3 gtable_0.2.0 DBI_0.7 [55] reshape2_1.4.2 R6_2.2.2 gridExtra_2.3 [58] knitr_1.17 rtracklayer_1.38.0 bit_1.1-12 [61] bindr_0.1 Hmisc_4.0-3 stringi_1.1.6 [64] Rcpp_0.12.14 geneplotter_1.56.0 rpart_4.1-11 [67] acepack_1.4.1 >
Thanks for your input.
Just to make sure I am understanding you correctly .. :) So you are saying I should do different designs for each of my questions? So doing one for Sum, one for Geno, and one for State?
You may want to discuss with a statistical collaborator on these modeling decisions.
It sounds like you think there is a separate effect for each state, so for this decision you would use ~Sum.