Dear ALL,
in conjuction/to continue one previous post that i created (C: Questions about the correct implementation of the function plotRLDF from R packa) , i present the part of my code used for the creation of a plot with the function plotRLDF(). Briefly, my notion (as described more extensively in the above link), is to selected from a pool of "hub genes" that have been identified, the top 50 or 40 that discriminate more my cancer from my control samples on my dataset:
dat <- as.data.frame(eset.sel) # my expression set subsetted with the 338 "hub genes" & 60 samples
set.seed(1)
# use the function from the R package caret
trainIndex <- createDataPartition(dat$Disease, p =.7, list=FALSE) # the categorical label that indicate Disease status--cancer or normal [also i used the vast majority of the samples as the training, and left a small percentage for testing]
train_data <- dat[trainIndex,1:338]
train_data <- dat[trainIndex,1:338]
train_labels <- dat[trainIndex,340] # keep only this factor label to test
train_labels
[1] Normal Cancer Cancer Cancer Normal Cancer Normal Normal Cancer
[10] Normal Cancer Normal Cancer Normal Cancer Cancer Cancer Normal
[19] Cancer Normal Normal Cancer Normal Cancer Normal Normal Cancer
[28] Normal Cancer Normal Cancer Normal Cancer Normal Normal Cancer
[37] Normal Cancer Normal Cancer Normal Cancer
Levels: Cancer Normal
# Similarly
test_data <- genes.set[-trainIndex,1:338]
test_labels <- genes.set[-trainIndex,340]
eset.train <- eset.sel[,rownames(train_data)] # quick way to subset my expressionSet
dim(eset.train)
Features Samples
338 42
eset.test <- eset.sel[,rownames(test_data)]
p <- plotRLDF(y=eset.train, design=model.matrix(~factor(train_labels,levels=c("Normal","Cancer"))),
z=eset.test, labels.y=train_labels, labels.z=test_labels,col.y="black",col.z="red")
legend("bottomleft", pch=16, col=c("black","red"), legend=c("Training","Predicted"))
Thus, my questions are the following--also here is the link to the created plot:
[https://www.dropbox.com/s/uvd7ozy6yc1oddu/plotRLDF.png?dl=0]
1) It makes any difference that i set the first level in the argument design as Normal, because by default the first level is "Cancer" ?? Or it will not make any actual difference ?
2) Regarding the interpretation of my above created plot: how can i briefly describe-explain the two dimensions-axes in my case--that is the two discriminant functions--? that the vast majority of the two classes would be separated and grouped in distinct positions ? which is not perfect, but "TRUE" in my case (considering also the general heterogeneity of my samples from tissue specimens of different patients)? Furthermore, the red-samples/testing set, which are grouped with these training samples, are explained in a way that have a "similar expression profile" with these specific samples?
3) Because my main purpose here is not a "perfect classification"-rather a first investigation (based also on my relatively small sample size) that a subset of these hub genes has a discriminatory power, which could be further searched (as i also described in my previous post), which other metrics could i evaluate from the plotRLDF function? Perhaps the "predicting" matrix?
4) Finally, one other very important question: because in the calculation of the plotRLDF i would like also to include other additional 8 continuous variables, to inspect if any of these are included in the top50 variables selected, should i first scale all the features together? and then divide my expressionSet as above ?? Or the scaling would not affect the classification procedure ?
Any help or suggestions would be essential !!!