Hello,
So, this is yet another question about the model-matrix-is-not-full-rank-error. I have looked through many posts but have not been able to resolve this error. I have 6 samples in the dataset. The data-set is as follows:
subject_grp Disease Time-point Sample1 1 Yes B Sample2 1 Yes T Sample3 4 No B Sample4 4 No T Sample5 9 Yes B Sample6 9 Yes T
I just want to check the impact of Disease variable on count data (of 16S amplicons) keeping in the mind that the 6 samples are in paired by subject_grp variable.
I don't care about the Time-point variable but if it helps in resolving the problem, please let me know.
Anyway, when I use the design as subject_grp + Disease, I get the dreaded matrix not full rank error. I have tried a bunch of different designs including A: DESeq2 paired multifactor test but so far have not been successful. I'd greatly appreciate if some help can be provided. Thanks a lot.
Ali
> sessionInfo() R version 3.2.1 (2015-06-18) Platform: x86_64-pc-linux-gnu (64-bit) Running under: Ubuntu 14.04.2 LTS locale: [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C LC_TIME=en_US.UTF-8 [4] LC_COLLATE=en_US.UTF-8 LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 [7] LC_PAPER=en_US.UTF-8 LC_NAME=C LC_ADDRESS=C [10] LC_TELEPHONE=C LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C attached base packages: [1] parallel stats4 stats graphics grDevices utils datasets methods base other attached packages: [1] DESeq2_1.6.3 RcppArmadillo_0.5.200.1.0 Rcpp_0.11.6 [4] GenomicRanges_1.18.4 GenomeInfoDb_1.2.4 IRanges_2.0.1 [7] S4Vectors_0.4.0 BiocGenerics_0.12.1 plyr_1.8.3 [10] koRpus_0.05-6 secondgenomeR_0.2.64 pander_0.5.2 [13] knitr_1.10.5 ggplot2_1.0.1 phyloseq_1.10.0 loaded via a namespace (and not attached): [1] nlme_3.1-121 bitops_1.0-6 matrixStats_0.14.2 RColorBrewer_1.1-2 [5] tools_3.2.1 vegan_2.3-0 rpart_4.1-10 KernSmooth_2.23-15 [9] Hmisc_3.16-0 DBI_0.3.1 mgcv_1.8-6 colorspace_1.2-6 [13] permute_0.8-4 ade4_1.7-2 nnet_7.3-10 gridExtra_2.0.0 [17] chron_2.3-47 sendmailR_1.2-1 Biobase_2.26.0 ggdendro_0.1-15 [21] caTools_1.17.1 scales_0.2.5 checkmate_1.6.1 BatchJobs_1.6 [25] genefilter_1.48.1 stringr_1.0.0 digest_0.6.8 foreign_0.8-65 [29] XVector_0.6.0 base64enc_0.1-2 limma_3.22.4 RSQLite_1.0.0 [33] BBmisc_1.9 BiocParallel_1.0.2 gtools_3.5.0 acepack_1.3-3.3 [37] RCurl_1.95-4.7 magrittr_1.5 Formula_1.2-1 Matrix_1.2-2 [41] munsell_0.4.2 ape_3.3 proto_0.3-10 stringi_0.5-5 [45] edgeR_3.8.5 MASS_7.3-43 RJSONIO_1.3-0 zlibbioc_1.12.0 [49] gplots_2.17.0 fail_1.2 grid_3.2.1 gdata_2.17.0 [53] crayon_1.3.1 lattice_0.20-33 Biostrings_2.34.1 splines_3.2.1 [57] multtest_2.22.0 annotate_1.44.0 locfit_1.5-9.1 igraph_1.0.1 [61] geneplotter_1.44.0 reshape2_1.4.1 codetools_0.2-14 XML_3.98-1.3 [65] latticeExtra_0.6-26 biom_0.3.12 data.table_1.9.4 foreach_1.4.2 [69] testthat_0.10.0 gtable_0.1.2 xtable_1.7-4 metagenomeSeq_1.10.0 [73] survival_2.38-3 pheatmap_1.0.7 iterators_1.0.7 AnnotationDbi_1.28.1 [77] memoise_0.2.1 cluster_2.0.2 brew_1.0-6
If you wanted to compare across time point, even testing time-point differences within or across disease would be possible. That's because you would be comparing within subject.
What is not possible with a fixed effects model, is control for any subject effects and at the same time isolate the main disease effect. As an example, suppose that the 1 and 9 subjects have 2x the counts for a gene compared to the 4 subject. Is this because disease induces 2x fold change? Or is this something to do with subjects 1 and 9, that is not related to disease? There is no way to tell from this experiment which is the case. The way to say this in statistical jargon is that these two variables are confounded.
Note that, in the limma package, you can inform the model that samples 1 and 2, 3 and 4, etc. are expected to be correlated, using the duplicateCorrelation function.
Okay, perfect. I think I've got it. The Time-point variable is not of immediate interest to our analysis, so we are not that concerned about it.
So, just to sum up this discussion (and to make sure I've got this absolutely clear in my head), I've got two solutions if I have to use the original design proposed:
1) Remove subject_grp from the formula and run analysis with just ~ Disease, making the unrealistic assumption that there is no subject_grp effect (Page 34 of vignette).
2) Explore limma package (instead of DESeq2) to inform model about sample correlations.
Please let me know this is fine and thanks a lot again for all of your help so far. I really, really appreciate it.
Best,
Ali
That's correct
Awesome! Thanks, again!