Question: Possible methodologies of machine learning in R that can assimilate continuous variables along gene expression data for microarray classification
0
3.7 years ago by
svlachavas660
Greece/Athens/National Hellenic Research Foundation
svlachavas660 wrote:

Dear Bioconductor community,

i have some gene expression microarray data, on which i would like to fit a machine learning methodology and construct a classifier regarding a binary outcome(Disease status). Although from literature and various papers i have found various packages and methodologies in R, as i would like also to add additional continuous variables alongside the genes, to train my classifier. Thus, as i dont have experience in this specific topic: is this approach generally appropriate for any model in classification procedures(i.e. randorm forests, SVM etc) ? or it is restricted to specific methodologies/packages in R that can handle this possibility ? I have knowledge of the caret R package which implements various methodologies, but my main concern is particularly about the "validity" of this "multivariate" approach !!

Any ideas or suggestions would be grateful !!

modified 3.7 years ago by Steve Lianoglou12k • written 3.7 years ago by svlachavas660
Answer: Possible methodologies of machine learning in R that can assimilate continuous v
1
3.7 years ago by
Denali
Steve Lianoglou12k wrote:

It's not clear what the perceived issue is here. It's quite common to use a variety of features from "different domains" for each observation/example while trying to build a predictive model ... in your case that means a mix of gene expression and other continuous features you think are important.

You'll also want to scale the columns of your feature matrix in some way, most commonly via calling scale on the feature matrix (assuming rows are examples, and columns are features/measurements), which will address the problem which you might be asking about ... you'll also find for many modeling functions, there will be a scale argument, which will handle scaling your features for you, as well as re-scaling the new observations you will be using your model to predict on using the parameters (mean and standard dev) that were observed while training (because it is technically not correct to scale all of your examples together before doing your training/testing splits).

Dear Steve,

thank you for your answer !! the main issue i adress here, is because im a newbie in machine learning(although i have read and searched many tutorials and papers, like http://www.r-bloggers.com/in-depth-introduction-to-machine-learning-in-15-hours-of-expert-videos/), is to have some feeback from experienced users from the field, if for "simple" methodologies like for istance "random forests"- i can use along with my gene expression microarray data other continuous variables(like clinical data) for the bulding of the classifier on the training set !! Or alternatively, the "valid" solution for this purpose is only general linear models(like the glment with the elastic net methodology) ??

Regarding the second part of your answer, i have knowledge that scaling is essential to various groups of variables(i.e. different groups of variables like in my case) in order to preserve unit variance and is implemented in various methods(also has a function in the trainControl in caret package). But still, my main concern is the first part of your answer !!

Best,

Efstathios