Using DESeq2 after agglomerating OTUs
Entering edit mode
Last seen 6.0 years ago
United States


Preface: this is my first post on this site so please excuse me if I don't know the proper posting protocols.

I am using the DESeq function in DESeq2 to look at differential abundance of OTUs between a case and control group (N~600 human subjects). I would also like to use this function to look at differential abundance of agglomerated counts at the phylum through genus levels. I haven't been able to find any posts about this, so my questions are:

1) Is it appropriate to use the DESeq function on raw agglomerated counts?

2) Should size factors be estimated only at the OTU level, and then should these same size factors be applied to the agglomerated datasets? Or should size factors be estimated at each level separately?

3) Another question unrelated to agglomeration: I have a very sparse OTU table (lots of zeros, ~260,000 OTUs), so I have been using this solution from a phyloseq vignette to estimate size factors:

gm_mean = function(x, na.rm=TRUE){ exp(sum(log(x[x > 0]), na.rm=na.rm) / length(x)) }

geoMeans = apply(counts(diagdds), 1, gm_mean)

diagdds = estimateSizeFactors(diagdds, geoMeans = geoMeans) 

However when I do this with my data, the large majority of the size factors end up being 1, with a few size factors of 0.99.

When I filter the OTU table to include only OTUs with at least 2 counts in at least 30 people (now ~3,000 OTUs), and then use the above code, the size factors appear to be calculated normally. Any idea why the size factor calculation fails with the larger OTU table?

Thanks in advance for your help!

Here is my session info:

> sessionInfo()
R version 3.2.0 (2015-04-16)
Platform: x86_64-unknown-linux-gnu (64-bit)
Running under: CentOS release 6.3 (Final)

 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C
 [9] LC_ADDRESS=C               LC_TELEPHONE=C

attached base packages:
[1] parallel  stats4    stats     graphics  grDevices utils     datasets
[8] methods   base

other attached packages:
 [1] DESeq2_1.10.0              RcppArmadillo_0.
 [3] Rcpp_0.12.1                SummarizedExperiment_1.0.1
 [5] Biobase_2.30.0             GenomicRanges_1.22.1
 [7] GenomeInfoDb_1.6.1         IRanges_2.4.1
 [9] S4Vectors_0.8.2            BiocGenerics_0.16.1
[11] phyloseq_1.14.0

loaded via a namespace (and not attached):
 [1] locfit_1.5-9.1       ape_3.3              lattice_0.20-33
 [4] Biostrings_2.38.0    digest_0.6.8         foreach_1.4.3
 [7] biom_0.3.12          plyr_1.8.3           chron_2.3-47
[10] futile.options_1.0.0 acepack_1.3-3.3      RSQLite_1.0.0
[13] ggplot2_1.0.1        zlibbioc_1.16.0      data.table_1.9.6
[16] annotate_1.48.0      vegan_2.3-1          rpart_4.1-10
[19] Matrix_1.2-2         proto_0.3-10         splines_3.2.0
[22] BiocParallel_1.4.0   geneplotter_1.48.0   stringr_1.0.0
[25] foreign_0.8-66       igraph_1.0.1         munsell_0.4.2
[28] multtest_2.26.0      mgcv_1.8-9           nnet_7.3-11
[31] gridExtra_2.0.0      Hmisc_3.17-0         codetools_0.2-14
[34] XML_3.98-1.3         permute_0.8-4        MASS_7.3-45
[37] grid_3.2.0           nlme_3.1-122         xtable_1.7-4
[40] gtable_0.1.2         DBI_0.3.1            magrittr_1.5
[43] scales_0.3.0         stringi_1.0-1        XVector_0.10.0
[46] reshape2_1.4.1       genefilter_1.52.0    latticeExtra_0.6-26
[49] futile.logger_1.4.1  Formula_1.2-1        lambda.r_1.1.7
[52] RColorBrewer_1.1-2   iterators_1.0.8      tools_3.2.0
[55] RJSONIO_1.3-0        ade4_1.7-2           survival_2.38-3
[58] AnnotationDbi_1.32.0 colorspace_1.2-6     cluster_2.0.3

phyloseq deseq2 estimatesizefactors • 1.5k views
Entering edit mode
Last seen 1 day ago
United States

I'm not an expert in metagenomic analysis, so I'll hold off from answering (1). 

(2) You should estimate the size factors for the same count matrix you are analyzing.

(3) I'm not sure what the optimal approach for size factor correction is when the matrices become very sparse (e.g. all rows having a majority of zeros), as I don't analyze such datasets myself. I would consider using upper quantiles, although you would have to choose what quantile to use. This paper discusses some methods:


Login before adding your answer.

Traffic: 517 users visited in the last hour
Help About
Access RSS

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6