[DESeq2] normalization doesn't remove my mean-variance trend
3
0
Entering edit mode
@aravenscraft-8644
Last seen 9.3 years ago
United States

Hi all,

I conducted a barcoded amplicon sequencing study to characterize the gut flora of butterflies subjected to different treatments.  I have an OTU table and I'm trying to learn to normalize the counts in it using DESeq2. Ultimately, I want to obtain normalized counts for downstream visualization, clustering, and modeling. I have worked through the vignette, but when I compare my plots to the examples in the vignette, they look pathological. I don’t think my data are being normalized correctly, and would greatly appreciate any suggestions for what I should do.

Here's what I've done:

# Import the data
gutdseq <- phyloseq_to_deseq2(exp, ~ gut)

# I have a high prevalence of sparsely sampled OTUs, so I use a zero-tolerant variant of geometric mean:
gm_mean = function(x, na.rm=TRUE){
   exp(sum(log(x[x > 0]), na.rm=na.rm) / length(x))
}
geoMeans <- apply(counts(gutdseq), 1, gm_mean)

gutdseq.fact <- estimateSizeFactors(gutdseq, geoMeans = geoMeans)
gutdseq.loc <- DESeq(gutdseq.fact, fitType="local")

# The resulting mean-dispersion relationship:

plotDispEsts(gutdseq.loc)

DispEsts_Plot

As far as I can tell from comparison with the plot on page 26, I think that looks ok. But the magnitudes of the dispersions of my normalized counts increase markedly with the mean:

resloc <- results(gutdseq.loc)
plotMA(resloc, main="gut: C vs g")

MA_Plot

This looks quite different from the example plots on page 10 of the DESeq2 vignette, and I suspect it's pathological.

Next, I tried extracting the transformed counts using both rlog and varianceStabilizingTransformation. The resulting mean-sd plots show pronounced mean-sd trends, and definitely look pathological compared to the examples on page 19. Furthermore, the rlog plot produces errors.

rlog.gut <- rlog(gutdseq.loc, fast=T)
vst.gut <- varianceStabilizingTransformation(gutdseq.loc)

library("vsn")
par(mfrow=c(1,3))
notAllZero <- (rowSums(counts(gutdseq.loc))>0)
meanSdPlot(log2(counts(gutdseq.loc,normalized=TRUE)[notAllZero,] + 1), main='log2(counts+1)')
meanSdPlot(assay(vst.gut[notAllZero,]), main='varianceStabilizingTransformation')
meanSdPlot(assay(rlog.gut[notAllZero,]), main='rlog')    
    # Warning message:
    # In KernSmooth::bkde2D(x, bandwidth = bandwidth, gridsize = nbin,  :
    # Binning grid too coarse for current (small) bandwidth: consider increasing 'gridsize'

traceback()
4: stop(sprintf(ngettext(length(r), "%s is not TRUE", "%s are not all TRUE"),
       ch), call. = FALSE, domain = NA)
3: stopifnot(is.logical(ranks), length(ranks) == 1, !is.na(ranks))
2: meanSdPlot(assay(vst.gut[notAllZero, ]), "varianceStabilizingTransformation")
1: meanSdPlot(assay(vst.gut[notAllZero, ]), "varianceStabilizingTransformation")

meanSD_Plots

I'm concerned about all of these pathological plots, and particularly concerned about the error I get from rlog, because my size factors vary a LOT, so the rlog transformation is the one I'm supposed to use (according to page 17 of the vignette).

hist(sizeFactors(gutdseq.loc), breaks=50, col='orange')

sizeFactor_histogram

Can anyone explain why my MA plots and meanSD plots look so bad?  I suspect it may have to do with the fact that I have a large number of sparsely sampled OTUs, and/or that there is a lot more variance in OTU counts between my samples than normal, but if that’s the case, what should I do to normalize my data appropriately?

 

Finally, here’s my session info.

sessionInfo()

R version 3.2.2 (2015-08-14)

Platform: x86_64-apple-darwin13.4.0 (64-bit)

Running under: OS X 10.9.5 (Mavericks)

locale:

[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:

[1] parallel  stats4    stats     graphics  grDevices utils     datasets  methods   base    

other attached packages:

 [1] vsn_3.36.0              Biobase_2.28.0          reshape2_1.4.1          plyr_1.8.3              chron_2.3-47            ape_3.3               

 [7] vegan_2.3-0             lattice_0.20-33         permute_0.8-4           ggplot2_1.0.1           scales_0.2.5            DESeq2_1.8.1          

[13] RcppArmadillo_0.5.300.4 phyloseq_1.12.2         Rcpp_0.12.0             GenomicRanges_1.20.5    GenomeInfoDb_1.4.2      IRanges_2.2.7         

[19] S4Vectors_0.6.3         BiocGenerics_0.14.0     BiocInstaller_1.18.4  

loaded via a namespace (and not attached):

 [1] locfit_1.5-9.1        Biostrings_2.36.3     digest_0.6.8          foreach_1.4.2         biom_0.3.12           futile.options_1.0.0

 [7] acepack_1.3-3.3       RSQLite_1.0.0         zlibbioc_1.14.0       data.table_1.9.4      annotate_1.46.1       rpart_4.1-10        

[13] Matrix_1.2-2          preprocessCore_1.30.0 proto_0.3-10          splines_3.2.2         BiocParallel_1.2.20   geneplotter_1.46.0  

[19] stringr_1.0.0         foreign_0.8-65        igraph_1.0.1          munsell_0.4.2         multtest_2.24.0       mgcv_1.8-7          

[25] nnet_7.3-10           gridExtra_2.0.0       Hmisc_3.16-0          codetools_0.2-14      XML_3.98-1.3          MASS_7.3-43         

[31] grid_3.2.2            affy_1.46.1           nlme_3.1-121          xtable_1.7-4          gtable_0.1.2          DBI_0.3.1           

[37] magrittr_1.5          KernSmooth_2.23-15    stringi_0.5-5         XVector_0.8.0         genefilter_1.50.0     affyio_1.36.0       

[43] limma_3.24.15         latticeExtra_0.6-26   futile.logger_1.4.1   Formula_1.2-1         lambda.r_1.1.7        RColorBrewer_1.1-2  

[49] iterators_1.0.7       tools_3.2.2           RJSONIO_1.3-0         ade4_1.7-2            survival_2.38-3       AnnotationDbi_1.30.1

[55] colorspace_1.2-6      cluster_2.0.3 

 

Thank you!! This is my first help request, so I apologize for any errors or confusion.

normalization deseq2 plotMA meanSDplot • 2.6k views
ADD COMMENT
2
Entering edit mode
@ryan-c-thompson-5618
Last seen 5 weeks ago
Icahn School of Medicine at Mount Sinai…

Based on the plots you're showing, it looks like many, perhaps almost all, of your genes have extremely low counts. Notice that your mean expression axis ranges from about 15 at the most to 0.001 at the least. These are rather low expression levels, likely all explained by just one or a few counts per gene. If that's the case, you're going to have a hard time getting any useful statistics out of such sparse data. For example, after the rlog transformation, it looks like all these low-count genes have a standard deviation of zero, which means that the regularization essentially eliminated all the differences in your data for about 80% of your genes because there just wasn't enough data for those genes to calculate meaningful expression differences between samples.

Perhaps your sequencing depth is very low or there is some other data quality problem?

Edit: I just realized you're looking at counts from an OTU table from a metagenomics study. My experience is that OTU counts are way too sparse to do anything useful with them using methods like edgeR or DESeq2 (except possibly for those few OTUs at the top of the expression distribution).

ADD COMMENT
2
Entering edit mode
@mikelove
Last seen 1 day ago
United States

re: "when I compare my plots to the examples in the vignette, they look pathological"

The plotDispEsts plot looks different, because in the vignette we analyze mRNA-seq. But it's ok to look different. The methods work just as  with alternative dispersion~mean relationships (switching to fitType="local" automatically or you can also choose "mean"), although Ryan is correct that with very low counts or very sparse data (only 1 sample per row having a non-zero count) you just won't have any sensitivity for DE.

For the transformations, I would suggest trying fitType="mean" and the VST.

For the size factors, these are multiplicative factors so it's best to look at these on the log scale.

ADD COMMENT
0
Entering edit mode
@aravenscraft-8644
Last seen 9.3 years ago
United States

Thank you Ryan and Michael!  That helps me understand what is happening much better.

I do have low sequencing depths, and my gut communities were both low diversity, and highly variable between individual insects.  The large number of low-abundance OTUs are likely stray bacteria that were just passing through.

Michael: could you explain why you recommend fitType="mean" in this case, instead of fitType="local"?

ADD COMMENT

Login before adding your answer.

Traffic: 344 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6