Hi,
I am wondering what the correct model formula and analysis approach is with DESeq2 for my multifactor paired design. I have tried to look at previous answers, e.g. DESEq2 comparison with two conditions and paired experiments and DESeq2 multiple factors nested design but I keep getting warnings from DESeq2, so I must be interpreting them wrong, or my design is slightly different.
What I have is 2 different tissue cell cultures, 5 replicates, that have been treated with a chemical. I have RNAseq count data for both the untreated and treated samples, so each treated vs. untreated is a pair:
sample tissue culture treatment
1 tissueX 1 unstimulated
2 tissueX 1 stimulated
3 tissueX 2 unstimulated
4 tissueX 2 stimulated
5 tissueX 3 unstimulated
6 tissueX 3 stimulated
7 tissueX 4 unstimulated
8 tissueX 4 stimulated
9 tissueX 5 unstimulated
10 tissueX 5 stimulated
11 tissueY 6 unstimulated
12 tissueY 6 stimulated
13 tissueY 7 unstimulated
14 tissueY 7 stimulated
15 tissueY 8 unstimulated
16 tissueY 8 stimulated
17 tissueY 9 unstimulated
18 tissueY 9 stimulated
19 tissueY 10 unstimulated
20 tissueY 10 stimulated
The genes that I'm interested in identifying, are those where the stimulation effect differs most between the two tissues. If not taking the paired design into account, I can use the model: ~tissue+treatment+tissue:treatment that is accepted without errors by DESeq2. However, I don't get very significant interactions and I'd like to take advantage of the paired design.
If I just analyze tissueX unstimulated versus stimulated, I can add the "culture" factor to the model and it improves the pvalues considerably: ~subject+treatment
Simply adding the subject term to the model formula as for the simple one tissue paired case, like this: ~subject + tissue+treatment+tissue:treatment gives me an error.
Trying to interpolate a previous answer to my case: ~ subject + treatment + tissue:treatment gives me the same error: "the model matrix is not full rank, so the model cannot be fit as specified."
How do I incorporate the paired design into the model formula and extract the correct significance values?
Thanks in advance,
Hanni
> sessionInfo()
R version 3.1.0 (2014-04-10)
Platform: x86_64-suse-linux-gnu (64-bit)
locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
[3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=en_US.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
attached base packages:
[1] stats4 parallel stats graphics grDevices utils datasets
[8] methods base
other attached packages:
[1] ggplot2_1.0.0 DESeq2_1.6.1
[3] RcppArmadillo_0.4.450.1.0 Rcpp_0.11.3
[5] GenomicRanges_1.18.1 GenomeInfoDb_1.2.2
[7] IRanges_2.0.0 S4Vectors_0.4.0
[9] Biobase_2.26.0 BiocGenerics_0.12.0
loaded via a namespace (and not attached):
[1] acepack_1.3-3.3 annotate_1.44.0 AnnotationDbi_1.28.0
[4] base64enc_0.1-2 BatchJobs_1.4 BBmisc_1.7
[7] BiocParallel_1.0.0 brew_1.0-6 checkmate_1.5.0
[10] cluster_1.15.3 codetools_0.2-9 colorspace_1.2-4
[13] DBI_0.3.1 digest_0.6.4 fail_1.2
[16] foreach_1.4.2 foreign_0.8-61 Formula_1.1-2
[19] genefilter_1.48.1 geneplotter_1.44.0 grid_3.1.0
[22] gtable_0.1.2 Hmisc_3.14-5 iterators_1.0.7
[25] labeling_0.3 lattice_0.20-29 latticeExtra_0.6-26
[28] locfit_1.5-9.1 MASS_7.3-31 munsell_0.4.2
[31] nnet_7.3-8 plyr_1.8.1 proto_0.3-10
[34] RColorBrewer_1.0-5 reshape2_1.4 rpart_4.1-8
[37] RSQLite_1.0.0 scales_0.2.4 sendmailR_1.2-1
[40] splines_3.1.0 stringr_0.6.2 survival_2.37-7
[43] tools_3.1.0 XML_3.98-1.1 xtable_1.7-4
[46] XVector_0.6.0
Hi Michael,
Although this is a 5 years old post, it is still a relevant one. I have been reading all your answers regarding paired multifactor design. I hope, I understood most of it. What I am struggling to understand is if our goal is to test for genes which respond to treatment effect in different tissues then (tissue:treatment) interaction term is the one to go for, why will we need (tissue:culture.nested) interaction term?
Thanks in advance for all the help you are providing.
Nurun
The difference between those terms is subtle and I’d recommend working with a statistician that can walk you through it.
Hi Michael,
I have a very similar dataset as described here, but with an additional time point involved. So basically there is another column with timepoint at 4h and 8h and the dataset is twice as big. How would it change the design? Is it still valid if I perform the same procedure as you described here, but separately for the two time points?
Thank you very much for your help!
Best regards drgb