Question

DESeq2 median of ratios normalization instead of VST or rlog for ordinations?

0

Entering edit mode

Linton • 0

@532d398e

Last seen 11 months ago

United States

Hi all, I have been combing the internet for hours trying to find a clear answer to this question and am a bit stumped. I apologize in advance for the basic question and the long description of my project - just trying to add enough context.

I am attemping to look at differences of abundances in shotgun metagenome samples that came from dust, specifically looking at gene abundances across contigs instead of MAGs. I co-assembled my reads (using normalized reads) into contigs, then mapped the non-normalized reads from each metagenome back to the assembled contigs to get an idea of the abundances of these genes across the metagenomes (a community-function based approach). After mapping the reads back to genes found in the assembly and counting up the reads mapped per gene using featureCounts, I then divided the total reads by gene length, and now would like to sum up these coverages by KO to get a sense of functional abundances in the metagenomes.

A colleague suggested I use the median-of-ratios transformation (MRT) by DESeq2 as a way to transform my coverages for ordinations, regressions, and PERMANOVAs to get a sense of what is influencing/driving functional differences in the sites. However, the DESeq2 vignettes suggest using the VST or rlog transformations for this purpose, rather than the median-of-ratios transformation that is used in the DESeq function for DGE.

My question is, is there a reason to not use MRT as a transformation outside of DGE? Why is VST better for ordinations and downstream analyses (outside of DGE) compared to MRT? Is I am having a bit of trouble understanding the nuance between MRT and the VST method employed by DESeq2... Is MRT less sensitive to gene dispersions compared to VST? I am also having trouble finding the exact equation used by the VST function in DESeq2...maybe I have just been looking at too many links today (see below).

Here are some resources I've used to help me answer this question - definitely open to more. I know there is the DESeq2 paper which I've read, but I think I am just missing somethign in there comparing these methods...Sorry for the long post, thanks for your help!

DESeq2 medianofratiostransformation StatisticalMethod Normalization dispersion • 2.0k views

ADD COMMENT • link 11 months ago Linton • 0

score 2 · Answer 1 · 2024-01-05

First of all, what you call "MRT" (lets stick with that abbreviation) is not a transformation. It's a normalization. What the method does is to calculate a per-sample linear scaling factor to correct for differences in sequencing depth and library composition. As clever as the method is, it's just that, a linear scaling factor. Divide your raw counts by that scaling factor (aka size factor) and you get normalized counts. No changes in data distribution or composition does happen, hence it's not a transformation. It's a linear scaling.

vst and rlog though are transformations. They aim to stabilize the variance of counts all across the average expression range. Typically, there is a dependency of the variance of the logcounts across the log2-baseMean (that is average logexpression). The methods aim to remove that, so removing the technical trend/bias, leaving biological signals in tact. Of note, before they do that, they use "MRT" to normalize the data. So basically, vst is "MRT" with some stats magic on top of that to transform data for downstream analysis, removing technical variance bias. The DESeq and DESeq2 papers have more details on that. See also the vignette.

So in a nutshell, normalization (that is removing sequencing depth and composition bias) is pretty much always necessary before any downstream analysis. Be it by MRT or other methods. Transformation depends on your analysis goal. Often vst is suitable, e.g. for PCA or feature / gene selection where you want to select genes variable between groups (like biological variability) rather than variable due to the technical biases. In any case, an alternative could be just log2+1 of the "MRT"-normalized counts as implemented in normTransform(). You should check what is common in the metagenomic field.