Hi,
Preface: this is my first post on this site so please excuse me if I don't know the proper posting protocols.
I am using the DESeq function in DESeq2 to look at differential abundance of OTUs between a case and control group (N~600 human subjects). I would also like to use this function to look at differential abundance of agglomerated counts at the phylum through genus levels. I haven't been able to find any posts about this, so my questions are:
1) Is it appropriate to use the DESeq function on raw agglomerated counts?
2) Should size factors be estimated only at the OTU level, and then should these same size factors be applied to the agglomerated datasets? Or should size factors be estimated at each level separately?
3) Another question unrelated to agglomeration: I have a very sparse OTU table (lots of zeros, ~260,000 OTUs), so I have been using this solution from a phyloseq vignette to estimate size factors:
gm_mean = function(x, na.rm=TRUE){ exp(sum(log(x[x > 0]), na.rm=na.rm) / length(x)) }
geoMeans = apply(counts(diagdds), 1, gm_mean)
diagdds = estimateSizeFactors(diagdds, geoMeans = geoMeans)
However when I do this with my data, the large majority of the size factors end up being 1, with a few size factors of 0.99.
When I filter the OTU table to include only OTUs with at least 2 counts in at least 30 people (now ~3,000 OTUs), and then use the above code, the size factors appear to be calculated normally. Any idea why the size factor calculation fails with the larger OTU table?
Thanks in advance for your help!
Here is my session info:
> sessionInfo()
R version 3.2.0 (2015-04-16)
Platform: x86_64-unknown-linux-gnu (64-bit)
Running under: CentOS release 6.3 (Final)
locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
[3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=en_US.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
attached base packages:
[1] parallel stats4 stats graphics grDevices utils datasets
[8] methods base
other attached packages:
[1] DESeq2_1.10.0 RcppArmadillo_0.6.200.2.0
[3] Rcpp_0.12.1 SummarizedExperiment_1.0.1
[5] Biobase_2.30.0 GenomicRanges_1.22.1
[7] GenomeInfoDb_1.6.1 IRanges_2.4.1
[9] S4Vectors_0.8.2 BiocGenerics_0.16.1
[11] phyloseq_1.14.0
loaded via a namespace (and not attached):
[1] locfit_1.5-9.1 ape_3.3 lattice_0.20-33
[4] Biostrings_2.38.0 digest_0.6.8 foreach_1.4.3
[7] biom_0.3.12 plyr_1.8.3 chron_2.3-47
[10] futile.options_1.0.0 acepack_1.3-3.3 RSQLite_1.0.0
[13] ggplot2_1.0.1 zlibbioc_1.16.0 data.table_1.9.6
[16] annotate_1.48.0 vegan_2.3-1 rpart_4.1-10
[19] Matrix_1.2-2 proto_0.3-10 splines_3.2.0
[22] BiocParallel_1.4.0 geneplotter_1.48.0 stringr_1.0.0
[25] foreign_0.8-66 igraph_1.0.1 munsell_0.4.2
[28] multtest_2.26.0 mgcv_1.8-9 nnet_7.3-11
[31] gridExtra_2.0.0 Hmisc_3.17-0 codetools_0.2-14
[34] XML_3.98-1.3 permute_0.8-4 MASS_7.3-45
[37] grid_3.2.0 nlme_3.1-122 xtable_1.7-4
[40] gtable_0.1.2 DBI_0.3.1 magrittr_1.5
[43] scales_0.3.0 stringi_1.0-1 XVector_0.10.0
[46] reshape2_1.4.1 genefilter_1.52.0 latticeExtra_0.6-26
[49] futile.logger_1.4.1 Formula_1.2-1 lambda.r_1.1.7
[52] RColorBrewer_1.1-2 iterators_1.0.8 tools_3.2.0
[55] RJSONIO_1.3-0 ade4_1.7-2 survival_2.38-3
[58] AnnotationDbi_1.32.0 colorspace_1.2-6 cluster_2.0.3