in my current microarray analysis, based on a methodology we have developed in my lab to identify "hub" genes from DE genes lists (which i acquired also from limma with various comparisons/TREAT etc), we resulted in a total set of ~800 DE probesets-hub genes. Thus, I would like to implement the function plotRLDF() to further test and identify a small subset of these genes, that separate my cancer from my normal samples and further test via various methodologies of unsupervised clustering. Despite the fact that my dataset is relatively small sample size-60 samples (30 cancer-30 normal)-, i would separate my dataset -via the R package caret- to create a training and a small testing dataset. Furthermore, my design matrix would be something like:
condition <- factor(eSet$Disease, levels=c("Normal","Cancer"))
pairs <- factor(rep(.., each = 2)) # because my samples are paired
design <- model.matrix(~condition+pairs)
My main questions are:
1) Because along with the above selected "probesets" that will be used in order to take the "top discriminant DE genes", i would like to incorporate among these 8 other continuous features, which are quantitative PET parameters. Thus, in order to use them in my expressionSet, should i create first a merged data frame and then perhaps scale all features (and of course then coerce it into an ExpressionSet object) ? Because it is considered very naively as a "linear classifier" ? Or it would not make such a difference ? And just stick these variables along my selected probesets?
2) If i want to further reduce the list of the "top" probesets used by the function, i should use something like nprobes=50 ? And this specific number is then returned by the function as the top performance probesets among the input ?
3) Except setting trend=TRUE, any other arguments like arrayWeights() could be also included ? Or they are irrelevant ? And if yes, they should be computed on the training set, right?
Thank you in advance,