Hello, I am analysing RNAseq data with limma. As input for KNN classification, should I use log2cpm data or voom transformed data? Thank you so much for your consideration.

limma KNN classification with voom-transformed data

For all intents and purposes, unless you are using a method that can take advantage of the observation weights that come out of `voom()`

, "voom transformed data" is essentially just "log2cpm" with a small prior count (0.5).

The problem with that is that you will have more variance around the lower expression values of your log2cpm data with such a small prior count, but your downstream analyses tools will likely expect data to more homoscedastic. This is OK for voom, because the weights are incorporated in the analysis, but they are likely not in your KNN procedure, or whatever else you want to throw at it.

As you call `edgeR::cpm(y, log = TRUE, prior.count = N)`

with larger and larger values of `N`

you will "hammer out" more and more the variance at the low end of expression, and you will find that it is often suggested on this support form to use a value for `prior.count`

between 3 and 5 to get your data "approximately" where you want it to be prior to feeding it into some clustering, pca, or whatever else algorithm you choose to run -- so you should prefer to use this approach as opposed to the "voomed" `$E`

matrix.

Another approach is to use the output from the `vst`

(variance stabilization transform) method found in the DESeq2 package to do the same. Perhaps you can think of the `vst`

transformation in DESeq2 as similar to the `edgeR::cpm(y, log = TRUE, prior.count = N)`

but the value of `N`

isn't constant throughout, which is to say that its value adapts in some smart way within the vst procedure itself.

