classification issues - normalization and standardization
1
0
Entering edit mode
@theresa-brandt-4589
Last seen 9.6 years ago
Hello, I use microarrays to create and test a classifier and I have a question realeted to this topic. Theoreticaly one cannot use a test set in creating a classifier. It is obvious when thinking about selection of differentiatially expressed genes and about training. But what about such steps like normalization, non-specific gene selection (for example selection of genes with high variance) and standardization? Can I perform this steps on the whole dataset? Or should I do it only using the training set? I saw that people rather don't care and use the whole dataset to perform this steps but I'm not sure if this is really correct. Best regards, Theresa Brandt [[alternative HTML version deleted]]
• 1.4k views
ADD COMMENT
0
Entering edit mode
@steve-lianoglou-2771
Last seen 14 months ago
United States
Hi Theresa, On Mon, Jul 18, 2011 at 7:16 AM, Theresa Brandt <theresabrandt80 at="" gmail.com=""> wrote: > Hello, > ?I use microarrays to create and test a classifier and I have a question > realeted to this topic. Theoreticaly one cannot use a test set in creating a > classifier. It is obvious when thinking about selection of differentiatially > expressed genes and about training. I'm a bit confused here. If, when you say, "cannot use a test set in creating a classifier", you mean that you can not use your test set during the training step of your model, then that's correct to some degree. People actively swap their data into different classes (training / testing) when doing things like cross validation (unless you have a completely separate/different validation set). But I digress .. > But what about such steps like > normalization, non-specific gene selection (for example selection of genes > with high variance) and standardization? Can I perform this steps on the > whole dataset? Or should I do it only using the training set? I saw that > people rather don't care and use the whole dataset to perform this steps but > I'm not sure if this is really correct. I wouldn't do much more to all of your data at once other than things like array/rma normalization. I think it might get a bit questionable when you are "feature mining" across all of your data, although there are scenarios like "transductive learning" that do something like that. I might try to just remove low-variance genes from your data by only calculating its variance after you split your data into training/test. If you were in a, say, 10-fold cross-validation scenario, then I'd be doing the "variance axe" 10 times. If you are concerned about how to normalize data you've never seen before so that you can apply your classifier to it at some later point after training/model building, you might want to look at "frozen RMA" http://www.bepress.com/jhubiostat/paper189/ which will allow you to normalize new/unseen data in some 'standard way' Perhaps others can provide better insight. Hope that helps, -steve -- Steve Lianoglou Graduate Student: Computational Systems Biology ?| Memorial Sloan-Kettering Cancer Center ?| Weill Medical College of Cornell University Contact Info: http://cbio.mskcc.org/~lianos/contact
ADD COMMENT
0
Entering edit mode
Hi Steve, Thank you very much for your help. Know it is clear for me. I can do the array normalization (like rma) on the whole data set. Then I have to split the dataset and I can do things like filtering of genes or gene standardization only on a training set. I was confused after reading a book "Bioconductor Case Studies". In the chapter about supervised machine learning they performed non-specific gene filtering and gene standardization on the whole dataset. But I would rather trust that you are right. Theresa Brandt [[alternative HTML version deleted]]
ADD REPLY
0
Entering edit mode
Hi Theresa, On Tue, Jul 19, 2011 at 3:47 AM, Theresa Brandt <theresabrandt80 at="" gmail.com=""> wrote: > Hi Steve, > ?Thank you very much for your help. Know it is clear for me. I can do the > array normalization (like rma) on the whole data set. Then I have to split > the dataset and I can do things like filtering of genes or gene > standardization only on a training set. > ?I was confused after reading a book "Bioconductor Case Studies". In the > chapter about supervised machine learning they performed non- specific gene > filtering and gene standardization on the whole dataset. But I would rather > trust that you are right. I wouldn't trust that I am right ... the people who wrote that book have some serious credentials. :-) There is arguably "lots" of things you can do to (all) of your data -- especially if you do not use the labels on your data as part of your data preprocessing. I was just suggesting what I might do in your situation is all. I never read the book you mentioned, though, but by looking at folks who wrote it, I would imagine what they are doing in that particular scenario is also valid. -steve -- Steve Lianoglou Graduate Student: Computational Systems Biology ?| Memorial Sloan-Kettering Cancer Center ?| Weill Medical College of Cornell University Contact Info: http://cbio.mskcc.org/~lianos/contact
ADD REPLY

Login before adding your answer.

Traffic: 657 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6