Hello, I am analysing RNAseq data with limma. As input for KNN classification, should I use log2cpm data or voom transformed data? Thank you so much for your consideration.
Hello, I am analysing RNAseq data with limma. As input for KNN classification, should I use log2cpm data or voom transformed data? Thank you so much for your consideration.
For all intents and purposes, unless you are using a method that can take advantage of the observation weights that come out of voom()
, "voom transformed data" is essentially just "log2cpm" with a small prior count (0.5).
The problem with that is that you will have more variance around the lower expression values of your log2cpm data with such a small prior count, but your downstream analyses tools will likely expect data to more homoscedastic. This is OK for voom, because the weights are incorporated in the analysis, but they are likely not in your KNN procedure, or whatever else you want to throw at it.
As you call edgeR::cpm(y, log = TRUE, prior.count = N)
with larger and larger values of N
you will "hammer out" more and more the variance at the low end of expression, and you will find that it is often suggested on this support form to use a value for prior.count
between 3 and 5 to get your data "approximately" where you want it to be prior to feeding it into some clustering, pca, or whatever else algorithm you choose to run -- so you should prefer to use this approach as opposed to the "voomed" $E
matrix.
Another approach is to use the output from the vst
(variance stabilization transform) method found in the DESeq2 package to do the same. Perhaps you can think of the vst
transformation in DESeq2 as similar to the edgeR::cpm(y, log = TRUE, prior.count = N)
but the value of N
isn't constant throughout, which is to say that its value adapts in some smart way within the vst procedure itself.
Use of this site constitutes acceptance of our User Agreement and Privacy Policy.