I am currently trying out
geNetClassifier to build a classifier for bulk RNA seq. However, I am somewhat unsure how exactly I should preprocess my counts before providing it to the
geNetClassifier method. In the vignette it says:
Note that since the ranking is built though package EBarrays, the data in the expression set should be normalized intensity values (positive and on raw scale, not on a logarithmic scale).
I am using RNA seq instead of a microarray so do not have intensity values. In their accompanying publication I read:
The preprocessed RNA-Seq expression data matrices containing the reads per kilobase per million mapped reads (RPKM) were downloaded from the TCGA data portal and were log2 transformed (log2(RPKM+1)) prior to be analysed with geNetClassifier.
So now I am unsure what exactly I should do. Right now I am using VST transformed counts on which I ran
limma::removeBatchEffect. If I understand the VST transformation correctly, it results in normalized log transformed counts.This count matrix I incorporate into an
eset which I then use to run
Is this approach correct or should I use a different normalization?
I would appreciate any comments or tips for other packages/methods that I could use to classify samples based on whole genome RNAseq!