Question

Choosing DESeq2 design for mice experiment

0

Entering edit mode

Emilie • 0

@emilie-14517

Last seen 4.4 years ago

Japan

Hello,

I have previously been using DESeq2 as part of HOMER, but I think I need more control over the analysis of this experiment, and would love some input on how to set up the design for DESeq2 (plus I would learn!).

Experimental setup: RNA-seq (PE) of mice. I have wildtype (WT) and transgenic (TG) mice. Half of my TG mice develop periods of behavioral abnormalities. At four different time points 1 WT mouse, 1 "normal TG" (TGN) and 1 "disease TG" (TGD) has been sampled. This is summarized below:

	SampleName	Geno	State	Mate	Run	Batch	Sum
C41BMACXX_PR0172_13A06_H1_L007	WT1	WT	N	A	C41BMACXX_PR0172_13A06_H1_L007	A	WTN
C41BMACXX_PR0172_17A04_H1_L008	TGN1	TG	N	B	C41BMACXX_PR0172_17A04_H1_L008	A	TGN
C41DJACXX_PR0172_21A06_H1_L001	TGD1	TG	D	C	C41DJACXX_PR0172_21A06_H1_L001	A	TGD
C41DJACXX_PR0172_25A04_H1_L002	WT2	WT	N	D	C41DJACXX_PR0172_25A04_H1_L002	B	WTN
C41DJACXX_PR0172_29A06_H1_L003	TGN2	TG	N	E	C41DJACXX_PR0172_29A06_H1_L003	B	TGN
C41DJACXX_PR0172_33A04_H1_L004	TGD2	TG	D	F	C41DJACXX_PR0172_33A04_H1_L004	B	TGD
C41DJACXX_PR0172_37A06_H1_L005	WT3	WT	N	G	C41DJACXX_PR0172_37A06_H1_L005	C	WTN
C41DJACXX_PR0172_41A04_H1_L006	TGN3	TG	N	D	C41DJACXX_PR0172_41A04_H1_L006	C	TGN
C41DJACXX_PR0172_45A06_H1_L007	TGD3	TG	D	E	C41DJACXX_PR0172_45A06_H1_L007	C	TGD
C41DJACXX_PR0172_49A04_H1_L008	WT4	WT	N	A	C41DJACXX_PR0172_49A04_H1_L008	D	WTN
C41JJACXX_PR0172_53A06_H1_L006	TGN4	TG	N	D	C41JJACXX_PR0172_53A06_H1_L006	D	TGN
C41JJACXX_PR0172_57A04_H1_L007	TGD4	TG	D	H	C41JJACXX_PR0172_57A04_H1_L007	D	TGD

Where "sampleName" is the uniqe name of each mouse, "Geno" is the genotype, "State" is normal or disease at the time of sampling, "Mate" is which litter the animal belongs to, "Run" is file name, "Batch" is sampling batch, and "Sum" is a collective of "Geno" and "State".

Now, I want to find genes that are differentially expressed in TG compared to WT as well as D compared to N mice. I would like to take into consideration the different sampling batches. After reading around at forums and the manual for DESeq2 I am sure I am not understanding how to properly decide on a design...

FYI, looking at PCA plot and heatmap there is no good separation of my samples based on anything. From the PCA plot I see that "Batch A" clusters somewhat far from the other samples and from the heatmap I see that "Batch D" cluster together while everything else is a hot mess. However, I am expecting only small differences between my conditions.

My first thought was to include an interaction effect between Geno and State, but that gave me an error (because all N animals are TG?)

>library(DESeq2)

> dds = DESeqDataSet(se,design= ~Batch + Geno:State)
Error in checkFullRank(modelMatrix) :
  the model matrix is not full rank, so the model cannot be fit as specified.
  Levels or combinations of levels without any samples have resulted in
  column(s) of zeros in the model matrix.

  Please read the vignette section 'Model matrix not full rank':

  vignette('DESeq2')

So instead I used following:

library(DESeq2)
dds = DESeqDataSet(se,design= ~Batch+Sum)

Which has been producing OK results (as is I am getting a few differentially expressed genes). But then I can only get info on the "Sum" and not compare e.g. all WT to all TG using:

results(dds, contrast = c("Geno","TG","WT"), alpha = 0.05)

Which I would like to do.. (Maybe I am doing this wrong?)

I hope this has been somewhat clear. I hope for any help to guide a newbie in the right direction :)

Links to similar questions that I may have missed are also highly appreciated! Trying to learn how properly do this, but am finding it a bit challenging...

> sessionInfo()
R version 3.4.2 (2017-09-28)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 16.04.3 LTS

Matrix products: default
BLAS: /usr/lib/libblas/libblas.so.3.6.0
LAPACK: /usr/lib/lapack/liblapack.so.3.6.0

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=ja_JP.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=ja_JP.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=ja_JP.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=ja_JP.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats4    parallel  stats     graphics  grDevices utils     datasets
[8] methods   base     

other attached packages:
 [1] org.Mm.eg.db_3.5.0         genefilter_1.60.0         
 [3] ggbeeswarm_0.6.0           PoiClaClu_1.0.2           
 [5] RColorBrewer_1.1-2         pheatmap_1.0.8            
 [7] hexbin_1.27.1              ggplot2_2.2.1             
 [9] dplyr_0.7.4                DESeq2_1.18.1             
[11] BiocParallel_1.12.0        BiocInstaller_1.28.0      
[13] GenomicFeatures_1.30.0     AnnotationDbi_1.40.0      
[15] GenomicAlignments_1.14.1   Rsamtools_1.30.0          
[17] Biostrings_2.46.0          XVector_0.18.0            
[19] SummarizedExperiment_1.8.0 DelayedArray_0.4.1        
[21] matrixStats_0.52.2         Biobase_2.38.0            
[23] GenomicRanges_1.30.0       GenomeInfoDb_1.14.0       
[25] IRanges_2.12.0             S4Vectors_0.16.0          
[27] BiocGenerics_0.24.0       

loaded via a namespace (and not attached):
 [1] RMySQL_0.10.13          bit64_0.9-7             splines_3.4.2          
 [4] Formula_1.2-2           assertthat_0.2.0        latticeExtra_0.6-28    
 [7] blob_1.1.0              vipor_0.4.5             GenomeInfoDbData_0.99.1
[10] progress_1.1.2          RSQLite_2.0             backports_1.1.1        
[13] lattice_0.20-35         glue_1.2.0              digest_0.6.12          
[16] checkmate_1.8.5         colorspace_1.3-2        htmltools_0.3.6        
[19] Matrix_1.2-11           plyr_1.8.4              XML_3.98-1.9           
[22] pkgconfig_2.0.1         biomaRt_2.34.0          zlibbioc_1.24.0        
[25] xtable_1.8-2            scales_0.5.0            tibble_1.3.4           
[28] htmlTable_1.9           annotate_1.56.1         nnet_7.3-12            
[31] lazyeval_0.2.1          survival_2.41-3         magrittr_1.5           
[34] memoise_1.1.0           foreign_0.8-69          beeswarm_0.2.3         
[37] tools_3.4.2             data.table_1.10.4-3     prettyunits_1.0.2      
[40] stringr_1.2.0           munsell_0.4.3           locfit_1.5-9.1         
[43] cluster_2.0.6           bindrcpp_0.2            compiler_3.4.2         
[46] rlang_0.1.4             grid_3.4.2              RCurl_1.95-4.8         
[49] htmlwidgets_0.9         labeling_0.3            bitops_1.0-6           
[52] base64enc_0.1-3         gtable_0.2.0            DBI_0.7                
[55] reshape2_1.4.2          R6_2.2.2                gridExtra_2.3          
[58] knitr_1.17              rtracklayer_1.38.0      bit_1.1-12             
[61] bindr_0.1               Hmisc_4.0-3             stringi_1.1.6          
[64] Rcpp_0.12.14            geneplotter_1.56.0      rpart_4.1-11           
[67] acepack_1.4.1          
>

deseq2 mouse rnaseq • 1.1k views

ADD COMMENT • link 6.4 years ago Emilie • 0

0

Entering edit mode

Thanks for your input.

Just to make sure I am understanding you correctly .. :) So you are saying I should do different designs for each of my questions? So doing one for Sum, one for Geno, and one for State?

ADD REPLY • link 6.4 years ago Emilie • 0

0

Entering edit mode

You may want to discuss with a statistical collaborator on these modeling decisions.

It sounds like you think there is a separate effect for each state, so for this decision you would use ~Sum.

ADD REPLY • link 6.4 years ago Michael Love 41k

score 0 · Answer 1 · 2017-12-01

With the linear model, you have to choose either the same effect (say TG vs WT) for both states, or different effects for each state, but you can't really model different and then extract the common effect. These models have different assumptions.

The closest you could do is to average the effect of TG vs WT across both states, but I don't like this approach, as the average could represent something that is not the case in either state (e.g. imagine if you have positive LFC in one state and negative but equal in size in the other state, the average gives you 0, which represents neither state). The average is therefore misleading in my opinion. If you think they will have different effect per state, I'd just go with those two results tables.