Question

rlogTransformation vs getVarianceStabilizedData

0

Entering edit mode

fabian.roger08 ▴ 10

@fabianroger08-11956

Last seen 3.0 years ago

Europe

Hej,

I am analysing a bacterial 16S dataset for differential abundance / correlations with phyloseq and DESeq2. I have on factor with 4 levels and ~10 samples in each (for which I'd like to control for) and 1 continuous variable for which I'd like to find OTUs that correlate with it.

my code is the following:

   ps <- phyloseq(otu_table(OTU_nifh, taxa_are_rows = F), sample_data(FuncDat))
   d2 <- phyloseq_to_deseq2(ps, ~Factor + Variable)
   Sigdiff <- DESeq(d2, fitType = "parametric", test = "Wald", minReplicatesForReplace = Inf)

First of all: is that the right way to do it?

now my question: I also want to plot the data with an NMDS plot. But I get quite different results depending whether I extract the variance stabilised data with

   getVarianceStabilizedData(Sigdiff)

or

  rlogTransformation(t(OTU_nifh))

(I can upload the scatterplots if this is helpful)

Is that expected?

thanks for your help! Fabian

deseq2 r phyloseq • 1.8k views

ADD COMMENT • link updated 9.0 years ago by Michael Love 43k • written 9.0 years ago by fabian.roger08 ▴ 10

score 1 · Answer 1 · 2016-12-01

hi Fabian,

Take a look at the vignette section on the transformations in DESeq2 (here I link to the devel vignette which has an HTML version):

https://bioconductor.org/packages/devel/bioc/vignettes/DESeq2/inst/doc/DESeq2.html#data-transformations-and-visualization

rlog and varianceStabilizingTransformation (or the even faster implementation, vst), are different transformations.

In short, the VST is closed form, derived by integrating 1/sqrt(g(mu)) dmu where g(mu) is the relationship specifying how the variance depends on the mean,which we already know (or can compute) in DESeq2 because we estimate a dispersion ~ mean relationship for differential testing. Here is a wikipedia article on VSTs.

The rlog is not a closed form relationship, but the shrunken log fold changes between samples applying the same procedure that DESeq2 applies to fold changes due to condition. However, the rlog doesn't look at condition when it performs the shrinkage. It just treats each sample individually. The rlog methods are further discussed in the DESeq2 paper.

The two methods both tend to stabilize the variance of log counts across the dynamic range, and we came up with the rlog because we noticed that when the size factors are very different (e.g. 10x difference between smallest and largest library), the rlog tended to outperform the VST in terms of recovering true clusters in simulations. The rlog has some disadvantages though, that it is certainly slower when there are lots of samples, e.g. 100s, because it has to shrink the fold changes per gene, whereas the VST is closed form. Also, the rlog can be sensitive to outliers where the VST is less so because it only looks across all genes at the dispersion trend, but doesn't reduce variance gene by gene.

For clustering and EDA, I think it's reasonable to perform both and look at the diagnostic plots as we have in the vignette to determine which is more useful for your data.