Choosing DESeq2 design for mice experiment
1
0
Entering edit mode
Emilie • 0
@emilie-14517
Last seen 5.0 years ago
Japan

Hello,

I have previously been using DESeq2 as part of HOMER, but I think I need more control over the analysis of this experiment, and would love some input on how to set up the design for DESeq2 (plus I would learn!).

Experimental setup: RNA-seq (PE) of mice. I have wildtype (WT) and transgenic (TG) mice. Half of my TG mice develop periods of behavioral abnormalities. At four different time points 1 WT mouse, 1 "normal TG" (TGN) and 1 "disease TG" (TGD) has been sampled. This is summarized below:


  SampleName Geno State Mate Run Batch Sum
C41BMACXX_PR0172_13A06_H1_L007 WT1 WT N A C41BMACXX_PR0172_13A06_H1_L007 A WTN
C41BMACXX_PR0172_17A04_H1_L008 TGN1 TG N B C41BMACXX_PR0172_17A04_H1_L008 A TGN
C41DJACXX_PR0172_21A06_H1_L001 TGD1 TG D C C41DJACXX_PR0172_21A06_H1_L001 A TGD
C41DJACXX_PR0172_25A04_H1_L002 WT2 WT N D C41DJACXX_PR0172_25A04_H1_L002 B WTN
C41DJACXX_PR0172_29A06_H1_L003 TGN2 TG N E C41DJACXX_PR0172_29A06_H1_L003 B TGN
C41DJACXX_PR0172_33A04_H1_L004 TGD2 TG D F C41DJACXX_PR0172_33A04_H1_L004 B TGD
C41DJACXX_PR0172_37A06_H1_L005 WT3 WT N G C41DJACXX_PR0172_37A06_H1_L005 C WTN
C41DJACXX_PR0172_41A04_H1_L006 TGN3 TG N D C41DJACXX_PR0172_41A04_H1_L006 C TGN
C41DJACXX_PR0172_45A06_H1_L007 TGD3 TG D E C41DJACXX_PR0172_45A06_H1_L007 C TGD
C41DJACXX_PR0172_49A04_H1_L008 WT4 WT N A C41DJACXX_PR0172_49A04_H1_L008 D WTN
C41JJACXX_PR0172_53A06_H1_L006 TGN4 TG N D C41JJACXX_PR0172_53A06_H1_L006 D TGN
C41JJACXX_PR0172_57A04_H1_L007 TGD4 TG D H C41JJACXX_PR0172_57A04_H1_L007 D TGD

Where "sampleName" is the uniqe name of each mouse, "Geno" is the genotype, "State" is normal or disease at the time of sampling, "Mate" is which litter the animal belongs to, "Run" is file name, "Batch" is sampling batch, and "Sum" is a collective of "Geno" and "State".

Now, I want to find genes that are differentially expressed in TG compared to WT as well as D compared to N mice. I would like to take into consideration the different sampling batches. After reading around at forums and the manual for DESeq2 I am sure I am not understanding how to properly decide on a design...

FYI, looking at PCA plot and heatmap there is no good separation of my samples based on anything. From the PCA plot I see that "Batch A" clusters somewhat far from the other samples and from the heatmap I see that "Batch D" cluster together while everything else is a hot mess. However, I am expecting only small differences between my conditions.

My first thought was to include an interaction effect between Geno and State, but that gave me an error (because all N animals are TG?)

>library(DESeq2)

> dds = DESeqDataSet(se,design= ~Batch + Geno:State)
Error in checkFullRank(modelMatrix) :
  the model matrix is not full rank, so the model cannot be fit as specified.
  Levels or combinations of levels without any samples have resulted in
  column(s) of zeros in the model matrix.

  Please read the vignette section 'Model matrix not full rank':

  vignette('DESeq2')

So instead I used following:

library(DESeq2)
dds = DESeqDataSet(se,design= ~Batch+Sum)

Which has been producing OK results (as is I am getting a few differentially expressed genes). But then I can only get info on the "Sum"  and not compare e.g. all WT to all TG using:

results(dds, contrast = c("Geno","TG","WT"), alpha = 0.05)

Which I would like to do.. (Maybe I am doing this wrong?)

 

I hope this has been somewhat clear. I hope for any help to guide a newbie in the right direction :)

Links to similar questions that I may have missed are also highly appreciated! Trying to learn how properly do this, but am finding it a bit challenging...

 

> sessionInfo()
R version 3.4.2 (2017-09-28)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 16.04.3 LTS

Matrix products: default
BLAS: /usr/lib/libblas/libblas.so.3.6.0
LAPACK: /usr/lib/lapack/liblapack.so.3.6.0

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=ja_JP.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=ja_JP.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=ja_JP.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=ja_JP.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats4    parallel  stats     graphics  grDevices utils     datasets
[8] methods   base     

other attached packages:
 [1] org.Mm.eg.db_3.5.0         genefilter_1.60.0         
 [3] ggbeeswarm_0.6.0           PoiClaClu_1.0.2           
 [5] RColorBrewer_1.1-2         pheatmap_1.0.8            
 [7] hexbin_1.27.1              ggplot2_2.2.1             
 [9] dplyr_0.7.4                DESeq2_1.18.1             
[11] BiocParallel_1.12.0        BiocInstaller_1.28.0      
[13] GenomicFeatures_1.30.0     AnnotationDbi_1.40.0      
[15] GenomicAlignments_1.14.1   Rsamtools_1.30.0          
[17] Biostrings_2.46.0          XVector_0.18.0            
[19] SummarizedExperiment_1.8.0 DelayedArray_0.4.1        
[21] matrixStats_0.52.2         Biobase_2.38.0            
[23] GenomicRanges_1.30.0       GenomeInfoDb_1.14.0       
[25] IRanges_2.12.0             S4Vectors_0.16.0          
[27] BiocGenerics_0.24.0       

loaded via a namespace (and not attached):
 [1] RMySQL_0.10.13          bit64_0.9-7             splines_3.4.2          
 [4] Formula_1.2-2           assertthat_0.2.0        latticeExtra_0.6-28    
 [7] blob_1.1.0              vipor_0.4.5             GenomeInfoDbData_0.99.1
[10] progress_1.1.2          RSQLite_2.0             backports_1.1.1        
[13] lattice_0.20-35         glue_1.2.0              digest_0.6.12          
[16] checkmate_1.8.5         colorspace_1.3-2        htmltools_0.3.6        
[19] Matrix_1.2-11           plyr_1.8.4              XML_3.98-1.9           
[22] pkgconfig_2.0.1         biomaRt_2.34.0          zlibbioc_1.24.0        
[25] xtable_1.8-2            scales_0.5.0            tibble_1.3.4           
[28] htmlTable_1.9           annotate_1.56.1         nnet_7.3-12            
[31] lazyeval_0.2.1          survival_2.41-3         magrittr_1.5           
[34] memoise_1.1.0           foreign_0.8-69          beeswarm_0.2.3         
[37] tools_3.4.2             data.table_1.10.4-3     prettyunits_1.0.2      
[40] stringr_1.2.0           munsell_0.4.3           locfit_1.5-9.1         
[43] cluster_2.0.6           bindrcpp_0.2            compiler_3.4.2         
[46] rlang_0.1.4             grid_3.4.2              RCurl_1.95-4.8         
[49] htmlwidgets_0.9         labeling_0.3            bitops_1.0-6           
[52] base64enc_0.1-3         gtable_0.2.0            DBI_0.7                
[55] reshape2_1.4.2          R6_2.2.2                gridExtra_2.3          
[58] knitr_1.17              rtracklayer_1.38.0      bit_1.1-12             
[61] bindr_0.1               Hmisc_4.0-3             stringi_1.1.6          
[64] Rcpp_0.12.14            geneplotter_1.56.0      rpart_4.1-11           
[67] acepack_1.4.1          
>
deseq2 mouse rnaseq • 1.3k views
ADD COMMENT
0
Entering edit mode

Thanks for your input.

Just to make sure I am understanding you correctly .. :) So you are saying I should do different designs for each of my questions? So doing one for Sum, one for Geno, and one for State?

ADD REPLY
0
Entering edit mode

You may want to discuss with a statistical collaborator on these modeling decisions.

It sounds like you think there is a separate effect for each state, so for this decision you would use ~Sum.

ADD REPLY
0
Entering edit mode
@mikelove
Last seen 1 day ago
United States

With the linear model, you have to choose either the same effect (say TG vs WT) for both states, or different effects for each state, but you can't really model different and then extract the common effect. These models have different assumptions.

The closest you could do is to average the effect of TG vs WT across both states, but I don't like this approach, as the average could represent something that is not the case in either state (e.g. imagine if you have positive LFC in one state and negative but equal in size in the other state, the average gives you 0, which represents neither state). The average is therefore misleading in my opinion. If you think they will have different effect per state, I'd just go with those two results tables.

ADD COMMENT

Login before adding your answer.

Traffic: 771 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6