How to use expression set object to select genes using different gene selection methods
1
0
Entering edit mode
@babumanish837-8404
Last seen 9.1 years ago
India

I want to select top k genes from the gds data and then i want to apply some classification algorithm to find the how much one gene selection algorithms (t-test,chi sq test,mRMR etc) works better from each other.I have used following R code to generate expression set from gds data.

library(GEOquery)

gds4515=getGEO(filename="GDS4515.soft.gz")

eset=GDS2eSet(gds4515,do.log2=TRUE)

 

Now i don't know what should i do now. At first have i to normalize it or have to do something else. if i have to normalize it that how can i do it. And after that what should i do. 

microarray biobase geoquery • 3.4k views
ADD COMMENT
1
Entering edit mode

GDS records have been normalized by the submitter.  If you agree that the normalization is appropriate, you could proceed with your analysis.  You say "select top k genes" and then "apply some classification algorithm" and then "gene selection algorithms".  I am not at all clear on what you are actually trying to do.

ADD REPLY
0
Entering edit mode

Dear Sean Davis,

I am working in a project in which i have to compare the performances of different gene selection algorithms (feature selection algorithms ) i.e t-test,chi square test,mRMR etc. I am working on two class genes microarray colon cancer data. At first i will divide the data into two parts 1. Training set and 2. Test Set and i will apply the above algo. in training set. Since a microarray contain very less number of samples and large number of genes(features). I want to reduce the no. of genes by different feature or gene selection algo. and have to compare the performances from each other. For comparing the performances i will use a classification algorithm  i.e SVM to classify the test set.

ADD REPLY
0
Entering edit mode
svlachavas ▴ 840
@svlachavas-7225
Last seen 14 months ago
Germany/Heidelberg/German Cancer Resear…

Dear Babumanish837,

you could first check the comprehensive vignette (http://bioconductor.org/packages/release/bioc/vignettes/GEOquery/inst/doc/GEOquery.html) which describes in detail about how to use the GEOquery package. Generally, you would want first to normalize your expressionset, and then apply some kind of non-speficic filtering(i.e non-specific intensity filtering or another combined filtering) to use a subset for your classification procedure. But, in this specific case, as you have used log2-transformation and you have your expression set you could move forward as:

1) inspect via a boxplot  how the data looks : boxplot(as.data.frame(exprs(eset))

2) use of other plots to perform an exploratory analysis(histograms, PCA plots,QQplots. MDSplots) to inspect further your data

3) the selection of the filtering is kind arbitary and depends on the experimental study. For instanse, you could perform a statistical test(i.e limma) and then select a subset of the DEG genes as possible candidates for classification. Or, use another combined filtering procedure, like the one described in the multtest R package:

  •  e <- exprs(eset)
  • library(genefilter)
  • my_fun <- filterfun(pOverA(p = 0.4, A = 100), cv(a = 0.7, b = 10)) # where here you can determine a double filter: at least 40% of the samples have an intensity value bigger than 100; and the coefficient of variation(sd/mean) is between 0.7 and 10
  • my_filter <- genefilter(2^e, my_fun) # unlog-2 the intensity values and apply the above filtering
  • eset_filter <- eset[my_filter,] # keep the "reliable" probesets 

To pinpoint also in the users guide of limma it has excellent preprossesing steps and various filtering methodologies for many studies, but the final choise is up to you

Best,

Efstathios

ADD COMMENT

Login before adding your answer.

Traffic: 716 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6