hi Fabian,

Take a look at the vignette section on the transformations in DESeq2 (here I link to the devel vignette which has an HTML version):

https://bioconductor.org/packages/devel/bioc/vignettes/DESeq2/inst/doc/DESeq2.html#data-transformations-and-visualization

`rlog`

and `varianceStabilizingTransformation`

(or the even faster implementation, `vst`

), are different transformations.

In short, the VST is closed form, derived by integrating 1/sqrt(g(mu)) dmu where g(mu) is the relationship specifying how the variance depends on the mean,which we already know (or can compute) in DESeq2 because we estimate a dispersion ~ mean relationship for differential testing. Here is a wikipedia article on VSTs.

The rlog is not a closed form relationship, but the shrunken log fold changes between samples applying the same procedure that DESeq2 applies to fold changes due to condition. However, the rlog doesn't look at condition when it performs the shrinkage. It just treats each sample individually. The rlog methods are further discussed in the DESeq2 paper.

The two methods both tend to stabilize the variance of log counts across the dynamic range, and we came up with the rlog because we noticed that when the size factors are very different (e.g. 10x difference between smallest and largest library), the rlog tended to outperform the VST in terms of recovering true clusters in simulations. The rlog has some disadvantages though, that it is certainly slower when there are lots of samples, e.g. 100s, because it has to shrink the fold changes per gene, whereas the VST is closed form. Also, the rlog can be sensitive to outliers where the VST is less so because it only looks across all genes at the dispersion trend, but doesn't reduce variance gene by gene.

For clustering and EDA, I think it's reasonable to perform both and look at the diagnostic plots as we have in the vignette to determine which is more useful for your data.

thanks for the great answer and the links!