Question: Help in interpreting the plot of the function plotRLDF in limma package for microarray dataset concerning two categorical classes
0
3.0 years ago by
svlachavas660
Greece/Athens/National Hellenic Research Foundation
svlachavas660 wrote:

Dear ALL,

in conjuction/to continue one previous post that i created (C: Questions about the correct implementation of the function plotRLDF from R packa) , i present the part of my code used for the creation of a plot with the function plotRLDF(). Briefly, my notion (as described more extensively in the above link), is to selected from a pool of "hub genes" that have been identified, the top 50 or 40 that discriminate more my cancer from my control samples on my dataset:

dat <- as.data.frame(eset.sel) # my expression set subsetted with the 338 "hub genes" & 60 samples

set.seed(1)

# use the function from the R package caret

trainIndex <- createDataPartition(dat\$Disease, p =.7, list=FALSE) # the categorical label that indicate Disease status--cancer or normal [also i used the vast majority of the samples as the training, and left a small percentage for testing]

train_data <- dat[trainIndex,1:338]

train_data <- dat[trainIndex,1:338] train_labels <- dat[trainIndex,340] # keep only this factor label to test train_labels  [1] Normal Cancer Cancer Cancer Normal Cancer Normal Normal Cancer [10] Normal Cancer Normal Cancer Normal Cancer Cancer Cancer Normal [19] Cancer Normal Normal Cancer Normal Cancer Normal Normal Cancer [28] Normal Cancer Normal Cancer Normal Cancer Normal Normal Cancer [37] Normal Cancer Normal Cancer Normal Cancer Levels: Cancer Normal

# Similarly

test_data <- genes.set[-trainIndex,1:338] test_labels <- genes.set[-trainIndex,340]

eset.train <- eset.sel[,rownames(train_data)] # quick way to subset my expressionSet dim(eset.train) Features  Samples       338       42  eset.test <- eset.sel[,rownames(test_data)]

p <- plotRLDF(y=eset.train, design=model.matrix(~factor(train_labels,levels=c("Normal","Cancer"))),  z=eset.test, labels.y=train_labels, labels.z=test_labels,col.y="black",col.z="red") legend("bottomleft", pch=16, col=c("black","red"), legend=c("Training","Predicted"))

Thus, my questions are the following--also here is the link to the created plot:

1) It makes any difference that i set the first level in the argument design as Normal, because by default the first level is "Cancer" ?? Or it will not make any actual difference ?

2) Regarding the interpretation of my above created plot: how can i briefly describe-explain the two dimensions-axes in my case--that is the two discriminant functions--? that the vast majority of the two classes would be separated and grouped in distinct positions ? which is not perfect, but "TRUE" in my case (considering also the general heterogeneity of my samples from tissue specimens of different patients)? Furthermore, the red-samples/testing set, which are grouped with these training samples, are explained in a way that have a "similar expression profile" with these specific samples?

3) Because my main purpose here is not a "perfect classification"-rather a first investigation (based also on my relatively small sample size) that a subset of these hub genes has a discriminatory power, which could be further searched (as i also described in my previous post), which other metrics could i evaluate from the plotRLDF function? Perhaps the "predicting" matrix?

4) Finally, one other very important question: because in the calculation of the plotRLDF i would like also to include other additional 8 continuous variables, to inspect if any of these are included in the top50 variables selected, should i first scale all the features together? and then divide my expressionSet as above ?? Or the scaling would not affect the classification procedure ?

Any help or suggestions would be essential !!!