Question: Possible methodologies of machine learning in R that can assimilate continuous variables along gene expression data for microarray classification
gravatar for svlachavas
4.0 years ago by
Greece/Athens/National Hellenic Research Foundation
svlachavas740 wrote:

Dear Bioconductor community,

i have some gene expression microarray data, on which i would like to fit a machine learning methodology and construct a classifier regarding a binary outcome(Disease status). Although from literature and various papers i have found various packages and methodologies in R, as i would like also to add additional continuous variables alongside the genes, to train my classifier. Thus, as i dont have experience in this specific topic: is this approach generally appropriate for any model in classification procedures(i.e. randorm forests, SVM etc) ? or it is restricted to specific methodologies/packages in R that can handle this possibility ? I have knowledge of the caret R package which implements various methodologies, but my main concern is particularly about the "validity" of this "multivariate" approach !!

Any ideas or suggestions would be grateful !!

ADD COMMENTlink modified 4.0 years ago by Steve Lianoglou12k • written 4.0 years ago by svlachavas740
Answer: Possible methodologies of machine learning in R that can assimilate continuous v
gravatar for Steve Lianoglou
4.0 years ago by
Steve Lianoglou12k wrote:

It's not clear what the perceived issue is here. It's quite common to use a variety of features from "different domains" for each observation/example while trying to build a predictive model ... in your case that means a mix of gene expression and other continuous features you think are important.

You'll also want to scale the columns of your feature matrix in some way, most commonly via calling scale on the feature matrix (assuming rows are examples, and columns are features/measurements), which will address the problem which you might be asking about ... you'll also find for many modeling functions, there will be a scale argument, which will handle scaling your features for you, as well as re-scaling the new observations you will be using your model to predict on using the parameters (mean and standard dev) that were observed while training (because it is technically not correct to scale all of your examples together before doing your training/testing splits).


ADD COMMENTlink modified 4.0 years ago • written 4.0 years ago by Steve Lianoglou12k

Dear Steve,

thank you for your answer !! the main issue i adress here, is because im a newbie in machine learning(although i have read and searched many tutorials and papers, like, is to have some feeback from experienced users from the field, if for "simple" methodologies like for istance "random forests"- i can use along with my gene expression microarray data other continuous variables(like clinical data) for the bulding of the classifier on the training set !! Or alternatively, the "valid" solution for this purpose is only general linear models(like the glment with the elastic net methodology) ??

Regarding the second part of your answer, i have knowledge that scaling is essential to various groups of variables(i.e. different groups of variables like in my case) in order to preserve unit variance and is implemented in various methods(also has a function in the trainControl in caret package). But still, my main concern is the first part of your answer !!



ADD REPLYlink modified 4.0 years ago • written 4.0 years ago by svlachavas740
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 16.09
Traffic: 397 users visited in the last hour