Hi all,
I conducted a barcoded amplicon sequencing study to characterize the gut flora of butterflies subjected to different treatments. I have an OTU table and I'm trying to learn to normalize the counts in it using DESeq2. Ultimately, I want to obtain normalized counts for downstream visualization, clustering, and modeling. I have worked through the vignette, but when I compare my plots to the examples in the vignette, they look pathological. I don’t think my data are being normalized correctly, and would greatly appreciate any suggestions for what I should do.
Here's what I've done:
# Import the data
gutdseq <- phyloseq_to_deseq2(exp, ~ gut)
# I have a high prevalence of sparsely sampled OTUs, so I use a zero-tolerant variant of geometric mean:
gm_mean = function(x, na.rm=TRUE){
exp(sum(log(x[x > 0]), na.rm=na.rm) / length(x))
}
geoMeans <- apply(counts(gutdseq), 1, gm_mean)
gutdseq.fact <- estimateSizeFactors(gutdseq, geoMeans = geoMeans)
gutdseq.loc <- DESeq(gutdseq.fact, fitType="local")
# The resulting mean-dispersion relationship:
plotDispEsts(gutdseq.loc)
As far as I can tell from comparison with the plot on page 26, I think that looks ok. But the magnitudes of the dispersions of my normalized counts increase markedly with the mean:
resloc <- results(gutdseq.loc) plotMA(resloc, main="gut: C vs g")
This looks quite different from the example plots on page 10 of the DESeq2 vignette, and I suspect it's pathological.
Next, I tried extracting the transformed counts using both rlog and varianceStabilizingTransformation. The resulting mean-sd plots show pronounced mean-sd trends, and definitely look pathological compared to the examples on page 19. Furthermore, the rlog plot produces errors.
rlog.gut <- rlog(gutdseq.loc, fast=T) vst.gut <- varianceStabilizingTransformation(gutdseq.loc) library("vsn") par(mfrow=c(1,3)) notAllZero <- (rowSums(counts(gutdseq.loc))>0) meanSdPlot(log2(counts(gutdseq.loc,normalized=TRUE)[notAllZero,] + 1), main='log2(counts+1)') meanSdPlot(assay(vst.gut[notAllZero,]), main='varianceStabilizingTransformation') meanSdPlot(assay(rlog.gut[notAllZero,]), main='rlog') # Warning message: # In KernSmooth::bkde2D(x, bandwidth = bandwidth, gridsize = nbin, : # Binning grid too coarse for current (small) bandwidth: consider increasing 'gridsize' traceback() 4: stop(sprintf(ngettext(length(r), "%s is not TRUE", "%s are not all TRUE"), ch), call. = FALSE, domain = NA) 3: stopifnot(is.logical(ranks), length(ranks) == 1, !is.na(ranks)) 2: meanSdPlot(assay(vst.gut[notAllZero, ]), "varianceStabilizingTransformation") 1: meanSdPlot(assay(vst.gut[notAllZero, ]), "varianceStabilizingTransformation")
I'm concerned about all of these pathological plots, and particularly concerned about the error I get from rlog, because my size factors vary a LOT, so the rlog transformation is the one I'm supposed to use (according to page 17 of the vignette).
hist(sizeFactors(gutdseq.loc), breaks=50, col='orange')
Can anyone explain why my MA plots and meanSD plots look so bad? I suspect it may have to do with the fact that I have a large number of sparsely sampled OTUs, and/or that there is a lot more variance in OTU counts between my samples than normal, but if that’s the case, what should I do to normalize my data appropriately?
Finally, here’s my session info.
sessionInfo() R version 3.2.2 (2015-08-14) Platform: x86_64-apple-darwin13.4.0 (64-bit) Running under: OS X 10.9.5 (Mavericks) locale: [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8 attached base packages: [1] parallel stats4 stats graphics grDevices utils datasets methods base other attached packages: [1] vsn_3.36.0 Biobase_2.28.0 reshape2_1.4.1 plyr_1.8.3 chron_2.3-47 ape_3.3 [7] vegan_2.3-0 lattice_0.20-33 permute_0.8-4 ggplot2_1.0.1 scales_0.2.5 DESeq2_1.8.1 [13] RcppArmadillo_0.5.300.4 phyloseq_1.12.2 Rcpp_0.12.0 GenomicRanges_1.20.5 GenomeInfoDb_1.4.2 IRanges_2.2.7 [19] S4Vectors_0.6.3 BiocGenerics_0.14.0 BiocInstaller_1.18.4 loaded via a namespace (and not attached): [1] locfit_1.5-9.1 Biostrings_2.36.3 digest_0.6.8 foreach_1.4.2 biom_0.3.12 futile.options_1.0.0 [7] acepack_1.3-3.3 RSQLite_1.0.0 zlibbioc_1.14.0 data.table_1.9.4 annotate_1.46.1 rpart_4.1-10 [13] Matrix_1.2-2 preprocessCore_1.30.0 proto_0.3-10 splines_3.2.2 BiocParallel_1.2.20 geneplotter_1.46.0 [19] stringr_1.0.0 foreign_0.8-65 igraph_1.0.1 munsell_0.4.2 multtest_2.24.0 mgcv_1.8-7 [25] nnet_7.3-10 gridExtra_2.0.0 Hmisc_3.16-0 codetools_0.2-14 XML_3.98-1.3 MASS_7.3-43 [31] grid_3.2.2 affy_1.46.1 nlme_3.1-121 xtable_1.7-4 gtable_0.1.2 DBI_0.3.1 [37] magrittr_1.5 KernSmooth_2.23-15 stringi_0.5-5 XVector_0.8.0 genefilter_1.50.0 affyio_1.36.0 [43] limma_3.24.15 latticeExtra_0.6-26 futile.logger_1.4.1 Formula_1.2-1 lambda.r_1.1.7 RColorBrewer_1.1-2 [49] iterators_1.0.7 tools_3.2.2 RJSONIO_1.3-0 ade4_1.7-2 survival_2.38-3 AnnotationDbi_1.30.1 [55] colorspace_1.2-6 cluster_2.0.3
Thank you!! This is my first help request, so I apologize for any errors or confusion.