Transforming Variance Stabilised Counts to Relative Abundances
1
1
Entering edit mode
adityabandla ▴ 10
@adityabandla-11584
Last seen 5.3 years ago

For my metagenomics dataset, I would like to retain only genes that are are >0.1% in abundance, for plotting and for subsetting differentially abundant genes

I applied VST on the raw counts. Is it OK to convert variance stabilised counts to relative abundances? or is it better to do this filtering by first transforming the raw counts matrix to relative abundances?

Edit. Since VST counts are log2 transformed, can I transform my VST counts as 2^x, to get the normalised counts and then convert these into relative abundances?

deseq2 • 1.2k views
ADD COMMENT
1
Entering edit mode
@wolfgang-huber-3550
Last seen 16 days ago
EMBL European Molecular Biology Laborat…

At the core, this is a question about variance-bias trade offs (VBTO) of two different estimators. VBTO is, of course, a huge topic that pervades much of statistics (see e.g. http://stats.stackexchange.com/questions/20295/what-problem-do-shrinkage-methods-solve etc). With your finite data sample, you can only imperfectly estimate the true, underlying, abundance. And the question is what trade-offs you accept. The two main opposing goals are precision and bias.

The "naive" counts (after suitable library size normalization) are unbiased estimators of true abundance, but for small numbers, can be highly variable. Also, ratios between them have nasty finite sample behavior. In contrast, the VST aims to trade in a more or less small amount of bias for a big reduction in variability and more normal behavior. (This is, for the small counts; for the large ones, VST and log2 are equivalent.)

So there is really no apodictic answer to your question, it depends on what you want to do. That said, I'd choose the normalized but otherwise untransformed values for a task such as you describe - just because it's simpler (Occam's razor).

 

 

 

ADD COMMENT
0
Entering edit mode

To add to the points here about bias and variance, we should say that counts scale with feature length (where here feature=gene), so the typical abundance measures divide out library size and feature length. If you really need an estimate of abundance across genes, you would want to divide out the feature length. 

If I were to re-phrase things, I would ask, do you really want to filter out low abundance features, or features with low signal to noise ratio? Our transformations help with the latter.

Another note, which I'm not sure is so widely known: if you use a fast transcript quantifier like Salmon "upstream" of DESeq2, then using the transformations in DESeq2 corrects both for library size and potential changes in feature length across samples.

ADD REPLY

Login before adding your answer.

Traffic: 573 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6