Hi James,
this is a general problem of normalization methods that work by
adapting
arrays in a set to themselves, and not to an independent reference.
Option 1 is indeed discredited when you want to get a fair estimate of
classification rates, since it does not faithfully simulate the real
application where you want to classify a new sample.
Option 2 does not work since f contains for each array a number of
array-specific, ideosyncratic parameters that reflect hybridization
conditions, labeling efficiency, RNA extraction etc. You cannot
"learn"
them in advance.
The option I'd take is to look for a normalization method that
normalizes each new array individually (or in sets appropriate to your
intended application) to an existing database of reference arrays. I
know that various people on this list have been/are working on such
methods. But I am probably not up-to-date myself - maybe someone can
recommend?
Best wishes
Wolfgang
------------------------------------------------------------------
Wolfgang Huber EBI/EMBL Cambridge UK http://www.ebi.ac.uk/huber
> Hi, I have a question for RMA normalization. Since RMA is an across
> sample
normalization, suppose I have 50 training samples (cel files) and 50
test samples (cel files). There are two ways to perform normalization:
> 1. Combine all the 100 samples together and use RMA to do
normalization. Then train the training set of 50 samples to classify
the
50 test samples.
> 2. Use the 50 training samples to do RMA, then each cel file is
converted to gene expression vector. Suppose the mapping from cel file
to expression vector is:
> Expression = f(cel). The form of f is determined by the 50 training
cel files. Then apply the same mapping to the test cel files.
>
> I would think method 2 is more reasonable and trully blind. However,
it is not clear how to determine the function f from the 50 training
cel
files. method 1 is easy to implement, but it is not trully blind,
since
the normalization of cel files from training samples actually utilized
the information from test cel files.
> Could anybody tell me how to determine the function f from the 50
training cel files?
>
> Many thanks, James
I would say that it depends on how you plan to use the classification
function.
If, in future, you will collect more samples, and use the
classification function to classify them, then you need to normalize
the test set the same way you will normalize the new arrays.
How you plan to do this may also affect how you normalize the training
set.
--Naomi
At 02:53 PM 12/17/2006, Wolfgang Huber wrote:
>Hi James,
>
>this is a general problem of normalization methods that work by
adapting
>arrays in a set to themselves, and not to an independent reference.
>
>Option 1 is indeed discredited when you want to get a fair estimate
of
>classification rates, since it does not faithfully simulate the real
>application where you want to classify a new sample.
>
>Option 2 does not work since f contains for each array a number of
>array-specific, ideosyncratic parameters that reflect hybridization
>conditions, labeling efficiency, RNA extraction etc. You cannot
"learn"
>them in advance.
>
>The option I'd take is to look for a normalization method that
>normalizes each new array individually (or in sets appropriate to
your
>intended application) to an existing database of reference arrays. I
>know that various people on this list have been/are working on such
>methods. But I am probably not up-to-date myself - maybe someone can
>recommend?
>
> Best wishes
> Wolfgang
>
>------------------------------------------------------------------
>Wolfgang Huber EBI/EMBL Cambridge UK http://www.ebi.ac.uk/huber
>
>
> > Hi, I have a question for RMA normalization. Since RMA is an
across
> > sample
>normalization, suppose I have 50 training samples (cel files) and 50
>test samples (cel files). There are two ways to perform
normalization:
> > 1. Combine all the 100 samples together and use RMA to do
>normalization. Then train the training set of 50 samples to classify
the
>50 test samples.
> > 2. Use the 50 training samples to do RMA, then each cel file is
>converted to gene expression vector. Suppose the mapping from cel
file
>to expression vector is:
> > Expression = f(cel). The form of f is determined by the 50
training
>cel files. Then apply the same mapping to the test cel files.
> >
> > I would think method 2 is more reasonable and trully blind.
However,
>it is not clear how to determine the function f from the 50 training
cel
>files. method 1 is easy to implement, but it is not trully blind,
since
>the normalization of cel files from training samples actually
utilized
>the information from test cel files.
> > Could anybody tell me how to determine the function f from the 50
>training cel files?
> >
> > Many thanks, James
>
>_______________________________________________
>Bioconductor mailing list
>Bioconductor at stat.math.ethz.ch
>https://stat.ethz.ch/mailman/listinfo/bioconductor
>Search the archives:
>http://news.gmane.org/gmane.science.biology.informatics.conductor
Naomi S. Altman 814-865-3791 (voice)
Associate Professor
Dept. of Statistics 814-863-7114 (fax)
Penn State University 814-865-1348
(Statistics)
University Park, PA 16802-2111
hi james,
briefly, to make new chips comparable to a training data set
normalized
with RMA you can do the following:
normalize your training arrays keeping track of:
(1) the means over the ranks used in quantile normalization
(2) the probe effects estimated by the median polish procedure
as the background correction is performed chip-by-chip, you can
transform each test (future) array to be compatible to the training
arrays (and the classifier) with the above information. f() then works
roughly like that:
* substitute the (ranked) test-expression values by the means over
the
ranks from (1) (you're normalized now)
* calculate a chip-effect (for each probe set) via subtracting the
probe effect from (2) from each probe set (you're done now)
i can send you the code for the above, in case you are interested.
all the best,
dennis
Naomi Altman wrote:
> I would say that it depends on how you plan to use the
classification function.
>
> If, in future, you will collect more samples, and use the
> classification function to classify them, then you need to normalize
> the test set the same way you will normalize the new arrays.
> How you plan to do this may also affect how you normalize the
training set.
>
> --Naomi
>
> At 02:53 PM 12/17/2006, Wolfgang Huber wrote:
>> Hi James,
>>
>> this is a general problem of normalization methods that work by
adapting
>> arrays in a set to themselves, and not to an independent reference.
>>
>> Option 1 is indeed discredited when you want to get a fair estimate
of
>> classification rates, since it does not faithfully simulate the
real
>> application where you want to classify a new sample.
>>
>> Option 2 does not work since f contains for each array a number of
>> array-specific, ideosyncratic parameters that reflect hybridization
>> conditions, labeling efficiency, RNA extraction etc. You cannot
"learn"
>> them in advance.
>>
>> The option I'd take is to look for a normalization method that
>> normalizes each new array individually (or in sets appropriate to
your
>> intended application) to an existing database of reference arrays.
I
>> know that various people on this list have been/are working on such
>> methods. But I am probably not up-to-date myself - maybe someone
can
>> recommend?
>>
>> Best wishes
>> Wolfgang
>>
>> ------------------------------------------------------------------
>> Wolfgang Huber EBI/EMBL Cambridge UK http://www.ebi.ac.uk/huber
>>
>>
>>> Hi, I have a question for RMA normalization. Since RMA is an
across
>>> sample
>> normalization, suppose I have 50 training samples (cel files) and
50
>> test samples (cel files). There are two ways to perform
normalization:
>>> 1. Combine all the 100 samples together and use RMA to do
>> normalization. Then train the training set of 50 samples to
classify the
>> 50 test samples.
>>> 2. Use the 50 training samples to do RMA, then each cel file is
>> converted to gene expression vector. Suppose the mapping from cel
file
>> to expression vector is:
>>> Expression = f(cel). The form of f is determined by the 50
training
>> cel files. Then apply the same mapping to the test cel files.
>>> I would think method 2 is more reasonable and trully blind.
However,
>> it is not clear how to determine the function f from the 50
training cel
>> files. method 1 is easy to implement, but it is not trully blind,
since
>> the normalization of cel files from training samples actually
utilized
>> the information from test cel files.
>>> Could anybody tell me how to determine the function f from the 50
>> training cel files?
>>> Many thanks, James
>> _______________________________________________
>> Bioconductor mailing list
>> Bioconductor at stat.math.ethz.ch
>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>> Search the archives:
>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>
> Naomi S. Altman 814-865-3791 (voice)
> Associate Professor
> Dept. of Statistics 814-863-7114 (fax)
> Penn State University 814-865-1348
(Statistics)
> University Park, PA 16802-2111
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at stat.math.ethz.ch
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives:
http://news.gmane.org/gmane.science.biology.informatics.conductor
>
Hi James,
Can I point you in the direction of the RefPlus package available in
Bioconductor release 2.0, which will do what I think you are looking
for, i.e. allowing additional cel files to be added into a data set
without affecting the gene expression or the normalisation parameters
calculated from the previously processed cel files.
You might also want to check out the paper from Darlene Goldstein in
Bioinformatics (2006 p2364-2372) which discusses similar algorithms.
All the best
Chris
Chris Harbron
Technical Lead Statistician,
AstraZeneca
Hi,
I have a question for RMA normalization. Since RMA is an across
sample
normalization, suppose I have 50 training samples (cel files) and 50
test samples (cel files). There are two ways to perform normalization:
1. Combine all the 100 samples together and use RMA to do
normalization.
Then train the training set of 50 samples to classify the 50 test
samples.
2. Use the 50 training samples to do RMA, then each cel file is
converted to gene expression vector. Suppose the mapping from cel file
to expression vector is:
Expression = f(cel). The form of f is determined by the 50 training
cel
files. Then apply the same mapping to the test cel files.
I would think method 2 is more reasonable and trully blind. However,
it is not clear how to determine the function f from the 50 training
cel
files. method 1 is easy to implement, but it is not trully blind,
since
the normalization of cel files from training samples actually utilized
the information from test cel files.
Could anybody tell me how to determine the function f from the 50
training cel files?
Many thanks,
James