rlogTransformation vs getVarianceStabilizedData
Entering edit mode
Last seen 14 months ago


I am analysing a bacterial 16S dataset for differential abundance / correlations with phyloseq and DESeq2. I have on factor with 4 levels and ~10 samples in each (for which I'd like to control for) and 1 continuous variable for which I'd like to find OTUs that correlate with it. 

my code is the following: 

   ps <- phyloseq(otu_table(OTU_nifh, taxa_are_rows = F), sample_data(FuncDat))
   d2 <- phyloseq_to_deseq2(ps, ~Factor + Variable)
   Sigdiff <- DESeq(d2, fitType = "parametric", test = "Wald", minReplicatesForReplace = Inf)

First of all: is that the right way to do it?

now my question: I also want to plot the data with an NMDS plot. But I get quite different results depending whether I extract the variance stabilised data with




(I can upload the scatterplots if this is helpful)

Is that expected? 

thanks for your help! Fabian


deseq2 r phyloseq • 983 views
Entering edit mode
Last seen 22 hours ago
United States

hi Fabian,

Take a look at the vignette section on the transformations in DESeq2 (here I link to the devel vignette which has an HTML version):


rlog and varianceStabilizingTransformation (or the even faster implementation, vst), are different transformations.

In short, the VST is closed form, derived by integrating 1/sqrt(g(mu)) dmu where g(mu) is the relationship specifying how the variance depends on the mean,which we already know (or can compute) in DESeq2 because we estimate a dispersion ~ mean relationship for differential testing. Here is a wikipedia article on VSTs.

The rlog is not a closed form relationship, but the shrunken log fold changes between samples applying the same procedure that DESeq2 applies to fold changes due to condition. However, the rlog doesn't look at condition when it performs the shrinkage. It just treats each sample individually. The rlog methods are further discussed in the DESeq2 paper.

The two methods both tend to stabilize the variance of log counts across the dynamic range, and we came up with the rlog because we noticed that when the size factors are very different (e.g. 10x difference between smallest and largest library), the rlog tended to outperform the VST in terms of recovering true clusters in simulations. The rlog has some disadvantages though, that it is certainly slower when there are lots of samples, e.g. 100s, because it has to shrink the fold changes per gene, whereas the VST is closed form. Also, the rlog can be sensitive to outliers where the VST is less so because it only looks across all genes at the dispersion trend, but doesn't reduce variance gene by gene.

For clustering and EDA, I think it's reasonable to perform both and look at the diagnostic plots as we have in the vignette to determine which is more useful for your data.

Entering edit mode

thanks for the great answer and the links!


Login before adding your answer.

Traffic: 538 users visited in the last hour
Help About
Access RSS

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6