Question

classification issues - normalization and standardization

0

Entering edit mode

Theresa Brandt ▴ 30

@theresa-brandt-4589

Last seen 9.6 years ago

Hello, I use microarrays to create and test a classifier and I have a question realeted to this topic. Theoreticaly one cannot use a test set in creating a classifier. It is obvious when thinking about selection of differentiatially expressed genes and about training. But what about such steps like normalization, non-specific gene selection (for example selection of genes with high variance) and standardization? Can I perform this steps on the whole dataset? Or should I do it only using the training set? I saw that people rather don't care and use the whole dataset to perform this steps but I'm not sure if this is really correct. Best regards, Theresa Brandt [[alternative HTML version deleted]]

• 1.4k views

ADD COMMENT • link updated 12.8 years ago by Steve Lianoglou ★ 13k • written 12.8 years ago by Theresa Brandt ▴ 30

score 0 · Answer 1 · 2011-07-18

Hi Theresa, On Mon, Jul 18, 2011 at 7:16 AM, Theresa Brandt <theresabrandt80 at="" gmail.com=""> wrote: > Hello, > ?I use microarrays to create and test a classifier and I have a question > realeted to this topic. Theoreticaly one cannot use a test set in creating a > classifier. It is obvious when thinking about selection of differentiatially > expressed genes and about training. I'm a bit confused here. If, when you say, "cannot use a test set in creating a classifier", you mean that you can not use your test set during the training step of your model, then that's correct to some degree. People actively swap their data into different classes (training / testing) when doing things like cross validation (unless you have a completely separate/different validation set). But I digress .. > But what about such steps like > normalization, non-specific gene selection (for example selection of genes > with high variance) and standardization? Can I perform this steps on the > whole dataset? Or should I do it only using the training set? I saw that > people rather don't care and use the whole dataset to perform this steps but > I'm not sure if this is really correct. I wouldn't do much more to all of your data at once other than things like array/rma normalization. I think it might get a bit questionable when you are "feature mining" across all of your data, although there are scenarios like "transductive learning" that do something like that. I might try to just remove low-variance genes from your data by only calculating its variance after you split your data into training/test. If you were in a, say, 10-fold cross-validation scenario, then I'd be doing the "variance axe" 10 times. If you are concerned about how to normalize data you've never seen before so that you can apply your classifier to it at some later point after training/model building, you might want to look at "frozen RMA" http://www.bepress.com/jhubiostat/paper189/ which will allow you to normalize new/unseen data in some 'standard way' Perhaps others can provide better insight. Hope that helps, -steve -- Steve Lianoglou Graduate Student: Computational Systems Biology ?| Memorial Sloan-Kettering Cancer Center ?| Weill Medical College of Cornell University Contact Info: http://cbio.mskcc.org/~lianos/contact