RE: RMA normalization

0

Entering edit mode

Mark Reimers ▴ 70

@mark-reimers-658

Last seen 11.4 years ago

Hello Hairong, Adai, That suggestion was mine a few weeks ago. My thinking currently is that we may reasonably expect different cell types to have different distributions of RNA abundances; as an extreme example, some cells specialize in making one protein for export. Then it seems to me our best shot is to make the raw data comparable within each cell type, and to make the different cell types comparable per identical weight of RNA (ideally we'd like to find some way to normalize by the number of cells). Normalization within cell types might be done by quantiles; normalization across cell types by the simpler (robust) mean until we can normalize by cells. Is there a better way? In practice I find substantial differences when normalizing across different cell types, as opposed to normalizing within cell types separately. Does anyone else have experience with this? Regards Mark Reimers Date: Fri, 10 Sep 2004 15:56:00 +0100 From: Adaikalavan Ramasamy <ramasamy@cancer.org.uk> Subject: RE: [BioC] RMA normalization To: Hairong Wei <hwei@ms.soph.uab.edu> Cc: BioConductor mailing list <bioconductor@stat.math.ethz.ch> Message-ID: <1094828160.3055.29.camel@ndmpc126.ihs.ox.ac.uk> Content-Type: text/plain I was under the impression getting a sufficient mRNA from a single sample was difficult enough. Sorry, I do not think I can be of much help as I never encountered this sort of problem, perhaps due to my own inability to distinguish the terms mRNA, sample, tissue. But there are many other people on the list who have better appreciation of biology and hopefully one of them could advise you. Could you give us the link to this message you are talking about. On Fri, 2004-09-10 at 15:26, Hairong Wei wrote: > Dear Adai: > > Thanks for asking. I got this phrase from the messages stored in the > archive yesterday. My understand is that, suppose you have 100 > arrays, and 10 mRNA samples from 10 tissues. Each 10 arrays are > hybridized with mRNAs from the same tissue. When you run RMA > algoritm, you run those arrays (10 each time) that hybridized with > mRNA from same tissue together rathan than running 100 arrays > together. After running RMA for each tissue, the scaling is applied to arrays form different tissues. > > The reason for doing this is that it is not reasonable to assume that > the arrays from different have the same distribution. > > What is you idea to do background.correction and normalization of 100 > arrays across 10 tissues? > > Thank you very much in advance > > Hairong Wei, Ph.D. > Department of Biostatisitics > University of Alabama at Birmingham > Phone: 205-975-7762 Mark Reimers, senior research fellow, National Cancer Inst., and SRA, 9000 Rockville Pike, bldg 37, room 5068 Bethesda MD 20892 [[alternative HTML version deleted]]

Normalization Cancer Normalization Cancer • 2.3k views

ADD COMMENT • link updated 21.4 years ago by Wolfgang Huber ★ 13k • written 21.4 years ago by Mark Reimers ▴ 70

0

Entering edit mode

Mark Dalphin ▴ 30

@mark-dalphin-571

Last seen 11.4 years ago

Our results match what you say Mark: normalization across cell types by quantiles, etc is problematic (at best) due t the different distributions of the RNA concentrations. In our hands, simulations suggest that when more than ~10% of the RNA species (randomly selected) change substantially in concentration, _all_ of our normalization methods go out of whack. We _believe_ that externally spiked-in standards would permit normalization across multiple cell types. While we haven't had the chance to test external standards for a variety of technical reasons (mostly finding a set that really shows no cross-hybridization, then preparing that set in sufficient quantity and then convincing all the people who do the benchwork that the extra work is worth while), I have read that this too is no panacea. The major complaint I have read is that a mis-matching of the amount of external standard will really damage the ability to normalize the experiment and leave no trace of that systematic error. It is apparently very tough to add the external standard in a reproducable manner (disclaimer: I have no bench experience with microarray). Mark Dalphin -----Original Message----- From: bioconductor-bounces@stat.math.ethz.ch [mailto:bioconductor-bounces@stat.math.ethz.ch]On Behalf Of Reimers, Mark (NIH/NCI) Sent: Monday, September 13, 2004 3:13 PM To: 'bioconductor@stat.math.ethz.ch' Subject: [BioC] RE: RMA normalization Hello Hairong, Adai, That suggestion was mine a few weeks ago. My thinking currently is that we may reasonably expect different cell types to have different distributions of RNA abundances; as an extreme example, some cells specialize in making one protein for export. Then it seems to me our best shot is to make the raw data comparable within each cell type, and to make the different cell types comparable per identical weight of RNA (ideally we'd like to find some way to normalize by the number of cells). Normalization within cell types might be done by quantiles; normalization across cell types by the simpler (robust) mean until we can normalize by cells. Is there a better way? In practice I find substantial differences when normalizing across different cell types, as opposed to normalizing within cell types separately. Does anyone else have experience with this? Regards Mark Reimers Date: Fri, 10 Sep 2004 15:56:00 +0100 From: Adaikalavan Ramasamy <ramasamy@cancer.org.uk> Subject: RE: [BioC] RMA normalization To: Hairong Wei <hwei@ms.soph.uab.edu> Cc: BioConductor mailing list <bioconductor@stat.math.ethz.ch> Message-ID: <1094828160.3055.29.camel@ndmpc126.ihs.ox.ac.uk> Content-Type: text/plain I was under the impression getting a sufficient mRNA from a single sample was difficult enough. Sorry, I do not think I can be of much help as I never encountered this sort of problem, perhaps due to my own inability to distinguish the terms mRNA, sample, tissue. But there are many other people on the list who have better appreciation of biology and hopefully one of them could advise you. Could you give us the link to this message you are talking about. On Fri, 2004-09-10 at 15:26, Hairong Wei wrote: > Dear Adai: > > Thanks for asking. I got this phrase from the messages stored in the > archive yesterday. My understand is that, suppose you have 100 > arrays, and 10 mRNA samples from 10 tissues. Each 10 arrays are > hybridized with mRNAs from the same tissue. When you run RMA > algoritm, you run those arrays (10 each time) that hybridized with > mRNA from same tissue together rathan than running 100 arrays > together. After running RMA for each tissue, the scaling is applied to arrays form different tissues. > > The reason for doing this is that it is not reasonable to assume that > the arrays from different have the same distribution. > > What is you idea to do background.correction and normalization of 100 > arrays across 10 tissues? > > Thank you very much in advance > > Hairong Wei, Ph.D. > Department of Biostatisitics > University of Alabama at Birmingham > Phone: 205-975-7762 Mark Reimers, senior research fellow, National Cancer Inst., and SRA, 9000 Rockville Pike, bldg 37, room 5068 Bethesda MD 20892 [[alternative HTML version deleted]] _______________________________________________ Bioconductor mailing list Bioconductor@stat.math.ethz.ch https://stat.ethz.ch/mailman/listinfo/bioconductor

ADD COMMENT • link 21.4 years ago Mark Dalphin ▴ 30

0

Entering edit mode

Matthew Hannah ▴ 940

@matthew-hannah-621

Last seen 11.4 years ago

Hi, I've tried to get a discussion on this several times but have got very few responses. I'm looking at some data where the treatment has a very BIG effect, but I don't think this is unusual it's just that a lot of people don't realise it or ignore it. If we take the average Pearson correlation of treated versus untreated as a crude indication of the number of changes (is this valid?) then in our experiments this is 0.96. Comparing 25 treated versus 25 untreated replicates (GCRMA, LIMMA gene-wise fdr corrected p<0.001) we get c.30% of transcripts on the chip changing! Looking at a couple of public datasets I don't think our treatment effect (as indicated by the Pearson) is that unusual, it's just that we have the statistical power to detect the changes. Also looking at the changes, and considering the biology it seems reasonable to get these changes. In the discussions on RMA/GCRMA there are 2 assumptions discussed 1)few genes changing - obviously not 2)equal # up and down - despite the huge amount of changes there are only 20 more transcripts going up compared to down - so yes. I've also looked at a number of control genes and can't find any real bias, in fact there is quite a bit of (random?) variation, so if you normalised on a few of these then you may get strange results... Also it depends what you are looking for. Amongst the 25 replicates we have different genotypes, and to look for differences here I GCRMA normalise treated and untreated separately, but then don't make comparisons untreated-treated, only between genotypes. Finally, I will at some point try separate GCRMAs and then scaling. If anyone has any scripts for mean, robust mean or median scaling a series of separate exprs sets then I'd appreciate it. Cheers, Matt Hello Hairong, Adai, That suggestion was mine a few weeks ago. My thinking currently is that we may reasonably expect different cell types to have different distributions of RNA abundances; as an extreme example, some cells specialize in making one protein for export. Then it seems to me our best shot is to make the raw data comparable within each cell type, and to make the different cell types comparable per identical weight of RNA (ideally we'd like to find some way to normalize by the number of cells). Normalization within cell types might be done by quantiles; normalization across cell types by the simpler (robust) mean until we can normalize by cells. Is there a better way? In practice I find substantial differences when normalizing across different cell types, as opposed to normalizing within cell types separately. Does anyone else have experience with this? Regards Mark Reimers Date: Fri, 10 Sep 2004 15:56:00 +0100 From: Adaikalavan Ramasamy <ramasamy@cancer.org.uk> Subject: RE: [BioC] RMA normalization To: Hairong Wei <hwei at="" ms.soph.uab.edu=""> Cc: BioConductor mailing list <bioconductor at="" stat.math.ethz.ch=""> Message-ID: <1094828160.3055.29.camel at ndmpc126.ihs.ox.ac.uk> Content-Type: text/plain I was under the impression getting a sufficient mRNA from a single sample was difficult enough. Sorry, I do not think I can be of much help as I never encountered this sort of problem, perhaps due to my own inability to distinguish the terms mRNA, sample, tissue. But there are many other people on the list who have better appreciation of biology and hopefully one of them could advise you. Could you give us the link to this message you are talking about. On Fri, 2004-09-10 at 15:26, Hairong Wei wrote: > Dear Adai: > > Thanks for asking. I got this phrase from the messages stored in the > archive yesterday. My understand is that, suppose you have 100 > arrays, and 10 mRNA samples from 10 tissues. Each 10 arrays are > hybridized with mRNAs from the same tissue. When you run RMA > algoritm, you run those arrays (10 each time) that hybridized with > mRNA from same tissue together rathan than running 100 arrays > together. After running RMA for each tissue, the scaling is applied to arrays form different tissues. > > The reason for doing this is that it is not reasonable to assume that > the arrays from different have the same distribution. > > What is you idea to do background.correction and normalization of 100 > arrays across 10 tissues? > > Thank you very much in advance > > Hairong Wei, Ph.D. > Department of Biostatisitics > University of Alabama at Birmingham > Phone: 205-975-7762 Mark Reimers, senior research fellow, National Cancer Inst., and SRA, 9000 Rockville Pike, bldg 37, room 5068 Bethesda MD 20892 [[alternative HTML version deleted]]

ADD COMMENT • link 21.4 years ago Matthew Hannah ▴ 940

0

Entering edit mode

Wolfgang Huber ★ 13k

@wolfgang-huber-3550

Last seen 4 months ago

EMBL European Molecular Biology Laborat…

On Wed, Sep 15, 2004 at 10:20:42AM +0200, Matthew Hannah wrote: >> I've tried to get a discussion on this several times but have got very >> few responses. >> >> I'm looking at some data where the treatment has a very BIG effect, but >> I don't think this is unusual it's just that a lot of people don't >> realise it or ignore it. Hi Matthew, Robert Gentleman and myself just talked about the issue that you brought up and here are some of our points: If by big effect you mean that a lot of genes change expression under different conditions, then I think that many people are aware of the issue, but that it has not been widely discussed. Frank Holstege and his group in Utrecht have worked quite systematically on this, and I would recommend looking at one of their recent papers: Monitoring global messenger RNA changes in externally controlled microarray experiments. van de Peppel J, Kemmeren P, van Bakel H, Radonjic M, van Leenen D, Holstege FC. EMBO Rep. 2003 Apr;4(4):387-93. PMID: 12671682 You seemed to indicate that having a large fraction of genes changing being only a problem for RMA and GCRMA, but our understanding is that it is a problem for all "intrinsic" methods of normalizing, i.e. those do not use external spike-in controls. We believe that the problem is largely to do with normalization, and not with computing expression estimates. As long as microarray experiments are not trying to measure absolute molecule abundances, but rather just "relative expression", then we think that the problem of interpreting situations in which most genes change will always remain hard. On the other hand side, if you use an "extrinsic" method, you need to decide whether you want to measure number of molecules per cell, or per total RNA, or per what ... so that's a conceptual issue that needs to be worked out. This is also known as a research *opportunity*. >> If we take the average Pearson correlation of treated versus untreated >> as a crude indication of the number of changes (is this valid?) then in >> our experiments this is 0.96. Comparing 25 treated versus 25 untreated >> replicates (GCRMA, LIMMA gene-wise fdr corrected p<0.001) we get c.30% >> of transcripts on the chip changing! You are too imprecise in your description for us to comment on whether the method is valid, but why don't you use good old-fashioned statistics: for each gene, calculate a p-value from e.g. t-test or an appropriate linear model generalization, and look at the histogram of p-values. The empirical p-values at the right end of the histogram should be approximately uniformly distributed, and the number of non-differentially expressed genes can be estimated by 2 times the genes with p>0.5. >> Looking at a couple of public datasets I don't think our treatment >> effect (as indicated by the Pearson) is that unusual, it's just that we >> have the statistical power to detect the changes. Also looking at the >> changes, and considering the biology it seems reasonable to get these >> changes. It depends a lot on what the factors are. Robert has some collaborators who use treatments that greatly change things, and others that use treatments that are so specific that the number of changes genes is under 10. No global statement that can be made here. For the former, they need to be warned that their experiment lies outside of the currently available technology, and then you do the best you can. With the latter, we should be able to do a good job, although one may never find the signal if p-value corrections are applied in a naive fashion (but that is a different story). >> In the discussions on RMA/GCRMA there are 2 assumptions discussed >> 1)few genes changing - obviously not >> 2)equal # up and down - despite the huge amount of changes there are >> only 20 more transcripts going up compared to down - so yes. As I said above, I do not believe this is specific to RMA or GCRMA, but rather a general problem for all normalization methods. Also, these are sufficient conditions, but not necessary. I.e. if they are fulfilled, (GC)RMA can be guaranteed to work, but if they aren't, the results from (GC)RMA, or other normalization methods, may still be valid to a sufficient degree! >> I've also looked at a number of control genes and can't find any real >> bias, in fact there is quite a bit of (random?) variation, so if you >> normalised on a few of these then you may get strange results... I think the problem with using control genes is that there are typically few of them and they do not necessarily span the range of intensities so they provide a poor basis for normalization. Although, in the present case they may be better than the alternatives. >> Finally, I will at some point try separate GCRMAs and then scaling. If >> anyone has any scripts for mean, robust mean or median scaling a series >> of separate exprs sets then I'd appreciate it. I hope that this is a rather simple use of lapply or similar (depending on how you have your exprSets stored). Regards, Robert, Wolfgang +--------------------------------------------------------------------- ---+ | Robert Gentleman phone : (617) 632-5250 | | Associate Professor fax: (617) 632-2444 | | Department of Biostatistics office: M1B20 | | Harvard School of Public Health email: rgentlem@jimmy.harvard.edu | +--------------------------------------------------------------------- ---+

ADD COMMENT • link 21.4 years ago Wolfgang Huber ★ 13k

0

Entering edit mode

On Wed, 2004-09-15 at 11:27, w.huber@dkfz-heidelberg.de wrote: > As I said above, I do not believe this is specific to RMA or GCRMA, but > rather a general problem for all normalization methods. I hate to be pedantic, but really people should be careful about how they utilize the term "normalization". These comments are not aimed at any particular individual(s), but I note that there is a trend particularly among those who come to Affymetrix style microarrays via the two color world to use "normalization" to apply the entire sequence of pre-processing the data. ie "I normalized my data using RMA" or "I have GCRMA normalized data" or "I used dChip normalization" ..... It is more precise to substitute some form of the term "pre-process" in for "normalization" in the above. Also, perhaps it is better to really talk about having "expression values". ie "RMA expression values","MAS5 expression values", "dChip expression values", "GCRMA expression values"..... Why is any of this important? Because normalization usually refers to something more specific. Many people like to think of the process of going from raw probe-intensity data to expression values as a process involving background adjustment, normalization and summarization steps. In this context "normalization" refers to the process of reducing unwanted technical variation. It so happens that in the case of RMA and GCRMA this procedure happens to be quantile normalization. Anyway, that is my opinion on the matter Ben -- Ben Bolstad <bolstad@stat.berkeley.edu> http://www.stat.berkeley.edu/~bolstad

ADD REPLY • link 21.4 years ago Ben Bolstad ★ 1.1k

0

Entering edit mode

Hi Ben, > I hate to be pedantic, but really people should be careful about how > they utilize the term "normalization". ... I fully agree with your posting, I didn't intend to equate RMA or GCRMA with just normalization and I had hoped that everybody on this list was aware that Affymetrix preprocessing involves more than just "normalization". But these other aspects weren't the point of our posting, which equally applies to vsn, loess, whatever. In fact I think "normalization" is not a very useful term at all: what we do there has nothing to do with the normal distribution, and I don't see what meaning the word root "normal" has in there. The word is often used in muddled way to mean all sorts of pre- processing. But again, "pre-processing" is not a particularly precise or intuitive term either. The problem I see is that the different aspects of pre-processing are not independent of each other; as soon as you start slicing up the problem of pre-processing in different sub-steps, that already involves approximations and presumptions about how to solve the problem. In the affy package / expresso method you (and Rafa & Laurent & others) have come up with a great and extremely useful way of slicing up the problem, but of course that's not the end of the story (as I understand, does the continuing work on methods like affyPLM indicate.) Best wishes Wolfgang

ADD REPLY • link 21.4 years ago Wolfgang Huber ★ 13k

0

Entering edit mode

Wolgang, w.huber@dkfz-heidelberg.de wrote: > Hi Ben, > > >>I hate to be pedantic, but really people should be careful about how >>they utilize the term "normalization". ... > > > I fully agree with your posting, I didn't intend to equate RMA or GCRMA > with just normalization and I had hoped that everybody on this list was > aware that Affymetrix preprocessing involves more than just > "normalization". But these other aspects weren't the point of our posting, > which equally applies to vsn, loess, whatever. > > In fact I think "normalization" is not a very useful term at all: what we > do there has nothing to do with the normal distribution, and I don't see > what meaning the word root "normal" has in there. Regarding semantics (too), bioinformatics can be a real headache. The meaning of 'normal' in this context is not the one used in statistics (nor the one used in geometry). This might have more to do with the "normal" used by chemists, or geologists. The aim of this particular step is to 'reduce'/transform/(pre-)process the signal in such way that effects like scanner settings, differences in amount total labelled targets, are corrected. As problems are better understood, one of which being the case where a majority of the genes are suspected to be differentially expressed, this pre-processing step can consists in 'tweaking' the data with other explicit objectives in mind (variance stabilization being one example). As you say below, the word is becoming used to describe what is no longer necessarily the primary objective of the transformation. > The word is often used in muddled way to mean all sorts of pre- processing. > But again, "pre-processing" is not a particularly precise or intuitive > term either. I like 'pre-processing'. Althought it is not extremely precise, I find it reasonably intuitive: the prefix 'pre' indicates that this is something done early in the process... > The problem I see is that the different aspects of pre-processing are not > independent of each other; as soon as you start slicing up the problem of > pre-processing in different sub-steps, that already involves > approximations and presumptions about how to solve the problem. In the > affy package / expresso method you (and Rafa & Laurent & others) have come > up with a great and extremely useful way of slicing up the problem, but of > course that's not the end of the story (as I understand, does the > continuing work on methods like affyPLM indicate.) An object model (in the computer sense) was proposed to let end-users perform pre-processing, and at the same time people interested in the methods to explore new approaches for preprocessing methods. However this is only a model, and it clearly has limitations... implementing the PDNN thing for the affy package required a more than usual number of programming tricks (one of which being the fishing a variable in an enclosing frame using dynamical scoping). As new trends in pre-processing appear, we will see clearer how to modify the objects structure in order to allow people to implement easily new approaches (we also wish people can implement directly using bioconductor, not us implement from their papers). L. > Best wishes > Wolfgang > > _______________________________________________ > Bioconductor mailing list > Bioconductor@stat.math.ethz.ch > https://stat.ethz.ch/mailman/listinfo/bioconductor >

ADD REPLY • link 21.4 years ago lgautier@altern.org ▴ 950

Login before adding your answer.