RMA-bimodality:

0

Entering edit mode

noel0925@sbcglobal.net ▴ 90

@noel0925sbcglobalnet-1574

Last seen 9.7 years ago

In the paper: Exploration, Normalization and Summaries of High Density Oligonucleotide Array Probe Level Data the following statement regarding the bimodality of log2(PM) values and RMA background corrected PM values can be found- "The same bimodal effect is seen when we stratisfy by log2(PM), thus it is not an artifact of conditioning on sums." (p4). I am a little confused by this as I thought that indeed an artifact of the convolution! Clearly, the background corrected intensity values are given by E(S | O) or the conditional expectation of the signal given what we observe; where the observed signal is the convolution of a normally distributed background (N) mean mu variance sigma^2 (B~ N(u, ??^2)) and an exponentially distributed signal (S) with mean alpha (S~ exp(??)). There have been several postings regarding this matter in the Bioconductor archives and all seem to point to this. Have I misunderstood? In particular was the following post: https://stat.ethz.ch/pipermail/bioconductor/2004-August/005908.html (See below the response from zwu at jhsph.edu The original question I got was about the bimodal distribution of gcrma result from probe intensities with unimodel distribution. My answer was that the "change" was not necessarily surprising. For example , when you have "true log signal" from a bimodal distribution logS=c(rnorm(1000,3,1),rnorm(1000,8,2)) # You will see this has two peaks par(mfrow=c(2,2)) plot(density(logS)) #if the background, log(non-specific binding) come from logB=rnorm(2000,6,1) #then when you plot the histogram of convolution in log scale, plot(density(log(exp(logS)+exp(logB)))) #you see only one peak, and this would be "before gcrma". This explanation made sense to me, but seems to contradict what is stated in the paper. Also, can someone explain the difference between RMA background version1 vs version2? Best regards, Noel

Normalization probe Normalization probe • 2.1k views

ADD COMMENT • link updated 17.9 years ago by Phguardiol@aol.com ▴ 720 • written 17.9 years ago by noel0925@sbcglobal.net ▴ 90

0

Entering edit mode

Wolfgang Huber ★ 13k

@wolfgang-huber-3550

Last seen 23 days ago

EMBL European Molecular Biology Laborat…

Hi, I am surprised why anybody is surprised about the different number of modes ("peaks"): the number of modes of a distribution is not conserved under monotonous transformations (such as the background correction in RMA), this simply follows from chain rule. See below for a simple example with some "mock" microarray intensities z and density of log-transformed values before and after a (primitive) background background correction. Cheers Wolfgang set.seed(123) n = 100000 z = 20 + exp(c(rnorm(n), 3+rnorm(n))) par(mfrow=c(1,2)) plot(density(log2(z))) plot(density(log2(z-20))) noel0925 at sbcglobal.net wrote: > In the paper: Exploration, Normalization and Summaries > of High Density Oligonucleotide Array Probe Level Data > the following statement regarding the > bimodality of log2(PM) values and RMA background > corrected PM values can be found- "The same bimodal > effect is seen when we stratisfy by log2(PM), thus it > is not an artifact of conditioning on sums." (p4). > I am a little confused by this as I thought that > indeed an artifact of the convolution! > > Clearly, the background corrected intensity > values are given by E(S | O) or the conditional > expectation of the signal given what we observe; where > the observed signal is the convolution of a normally > distributed background (N) mean mu variance sigma^2 > (B~ N(u, ??^2)) and an exponentially distributed > signal (S) with mean alpha (S~ exp(??)). > > There have been several postings regarding this matter > in the Bioconductor archives and all seem to point to > this. Have I misunderstood? > > In particular was the following post: > https://stat.ethz.ch/pipermail/bioconductor/2004-August/005908.html > (See below the response from zwu at jhsph.edu > > The original question I got was about the bimodal > distribution of gcrma > result from probe intensities with unimodel > distribution. My answer was > that the "change" was not necessarily surprising. > > For example , when you have "true log signal" from a > bimodal distribution > logS=c(rnorm(1000,3,1),rnorm(1000,8,2)) > # You will see this has two peaks > par(mfrow=c(2,2)) > plot(density(logS)) > #if the background, log(non-specific binding) come > from > logB=rnorm(2000,6,1) > #then when you plot the histogram of convolution in > log scale, > plot(density(log(exp(logS)+exp(logB)))) > #you see only one peak, and this would be "before > gcrma". > > This explanation made sense to me, but seems to > contradict what is stated in the paper. > > Also, can someone explain the difference between RMA > background version1 vs version2? > > > Best regards, > Noel > > _______________________________________________ > Bioconductor mailing list > Bioconductor at stat.math.ethz.ch > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor -- ------------------------------------------------------------------ Wolfgang Huber EBI/EMBL Cambridge UK http://www.ebi.ac.uk/huber

ADD COMMENT • link 17.9 years ago Wolfgang Huber ★ 13k

0

Entering edit mode

Wolfgang, Thank you for your reply. Just so that I am clear- the point is that the bimodality is not an artifact of the convolution, but simply the fact that the number of modes of a distribution is not conserved under monotonous transformations. This is why the paper points to the fact that the histograms of log2 (PMs/MMs) stratified by log2(PMs) is bimodal -so bimodality is a more general property of the probe level data. Please clarify if this is incorrect. Thanks, Noel --- Wolfgang Huber <huber at="" ebi.ac.uk=""> wrote: > > Hi, > > I am surprised why anybody is surprised about the > different number of > modes ("peaks"): the number of modes of a > distribution is not conserved > under monotonous transformations (such as the > background correction in > RMA), this simply follows from chain rule. > > See below for a simple example with some "mock" > microarray intensities z > and density of log-transformed values before and > after a (primitive) > background background correction. > > Cheers > Wolfgang > > > set.seed(123) > > n = 100000 > z = 20 + exp(c(rnorm(n), 3+rnorm(n))) > > par(mfrow=c(1,2)) > plot(density(log2(z))) > plot(density(log2(z-20))) > > > noel0925 at sbcglobal.net wrote: > > In the paper: Exploration, Normalization and > Summaries > > of High Density Oligonucleotide Array Probe Level > Data > > the following statement regarding the > > bimodality of log2(PM) values and RMA background > > corrected PM values can be found- "The same > bimodal > > effect is seen when we stratisfy by log2(PM), thus > it > > is not an artifact of conditioning on sums." (p4). > > I am a little confused by this as I thought that > > indeed an artifact of the convolution! > > > > Clearly, the background corrected intensity > > values are given by E(S | O) or the conditional > > expectation of the signal given what we observe; > where > > the observed signal is the convolution of a > normally > > distributed background (N) mean mu variance > sigma^2 > > (B~ N(u, ????^2)) and an exponentially distributed > > signal (S) with mean alpha (S~ exp(????)). > > > > There have been several postings regarding this > matter > > in the Bioconductor archives and all seem to point > to > > this. Have I misunderstood? > > > > In particular was the following post: > > > https://stat.ethz.ch/pipermail/bioconductor/2004-August/005908.html > > (See below the response from zwu at jhsph.edu > > > > The original question I got was about the bimodal > > distribution of gcrma > > result from probe intensities with unimodel > > distribution. My answer was > > that the "change" was not necessarily surprising. > > > > For example , when you have "true log signal" from > a > > bimodal distribution > > logS=c(rnorm(1000,3,1),rnorm(1000,8,2)) > > # You will see this has two peaks > > par(mfrow=c(2,2)) > > plot(density(logS)) > > #if the background, log(non-specific binding) come > > from > > logB=rnorm(2000,6,1) > > #then when you plot the histogram of convolution > in > > log scale, > > plot(density(log(exp(logS)+exp(logB)))) > > #you see only one peak, and this would be "before > > gcrma". > > > > This explanation made sense to me, but seems to > > contradict what is stated in the paper. > > > > Also, can someone explain the difference between > RMA > > background version1 vs version2? > > > > > > Best regards, > > Noel > > > > _______________________________________________ > > Bioconductor mailing list > > Bioconductor at stat.math.ethz.ch > > https://stat.ethz.ch/mailman/listinfo/bioconductor > > Search the archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor > > > -- > ------------------------------------------------------------------ > Wolfgang Huber EBI/EMBL Cambridge UK > http://www.ebi.ac.uk/huber > >

ADD REPLY • link 17.9 years ago noel0925@sbcglobal.net ▴ 90

0

Entering edit mode

Hi Noel, > Just so that I am clear- the point is that the > bimodality is not an artifact of the convolution, but > simply the fact that the number of modes of a > distribution is not conserved under monotonous > transformations. No, I did not say that, and I do not know how to understand this sentence, since "the convolution" is directly related to "the monotonous transformation" that we are talking about > This is why the paper points to the > fact that the histograms of log2 (PMs/MMs) stratified > by log2(PMs) is bimodal I leave the exegesis of the paper to its authors. > -so bimodality is a more > general property of the probe level data. As you have just said yourself, the number of modes is not a property of the data, but of the data plus the particular (non-linear) transformation that you choose to apply to them. Best wishes Wolfgang.

ADD REPLY • link 17.9 years ago Wolfgang Huber ★ 13k

0

Entering edit mode

Hi Wolfgang (and everybody else)! As pointed out by you there are two different issues here: a) the bi-modality of (GC)RMA normalized data on many chips (which I have observed repeatedly now as well ), b) the bi-modality of log(PM/MM) values as stated in the Irrizarry et al. paper. In both cases the mathematical argument, that any continuous distribution can be monotonely transformed into any other continuous distribution holds (which is basically behind your statement that monotonous transformations do not preserve the number of peaks/modes), but I still think, that the observation a) of bi-modal distributions of gcrma normalized expression values is worth to be discussed. Assuming GCRMA is good/perfect normalisation method the normalised values should directly relate to the "true" biological expressions and thus it is tempting to take such a histogram as an indication of there being two classes of genes: i) genes with no/small expression values (forming the first peak), ii) truely/highly expressed genes (forming the second peak). If on the other hand the bi-modality is an implicit by-product of the GCRMA-normalisation, it doesn't make sense to interpret the bi- modality biologically in that way. I have only limited experiences with Affy arrays so far, but at least in one case the bi-modality also occured (but not so clearly) when using MAS5 instead of GCRMA, which I took as an indication that in this case, that GCRMA didn't create the two modes, but just made it easier to distinguish between them. I would be interested to hear the experiences of others in this respect. Best Wishes Claus Wolfgang Huber wrote: > Hi, > > I am surprised why anybody is surprised about the different number of > modes ("peaks"): the number of modes of a distribution is not conserved > under monotonous transformations (such as the background correction in > RMA), this simply follows from chain rule. > > See below for a simple example with some "mock" microarray intensities z > and density of log-transformed values before and after a (primitive) > background background correction. > > Cheers > Wolfgang > > > set.seed(123) > > n = 100000 > z = 20 + exp(c(rnorm(n), 3+rnorm(n))) > > par(mfrow=c(1,2)) > plot(density(log2(z))) > plot(density(log2(z-20))) > > > noel0925 at sbcglobal.net wrote: > >> In the paper: Exploration, Normalization and Summaries >> of High Density Oligonucleotide Array Probe Level Data >> the following statement regarding the >> bimodality of log2(PM) values and RMA background >> corrected PM values can be found- "The same bimodal >> effect is seen when we stratisfy by log2(PM), thus it >> is not an artifact of conditioning on sums." (p4). >> I am a little confused by this as I thought that >> indeed an artifact of the convolution! >> >> Clearly, the background corrected intensity >> values are given by E(S | O) or the conditional >> expectation of the signal given what we observe; where >> the observed signal is the convolution of a normally >> distributed background (N) mean mu variance sigma^2 >> (B~ N(u, ??^2)) and an exponentially distributed >> signal (S) with mean alpha (S~ exp(??)). >> >> There have been several postings regarding this matter >> in the Bioconductor archives and all seem to point to >> this. Have I misunderstood? >> >> In particular was the following post: >> https://stat.ethz.ch/pipermail/bioconductor/2004-August/005908.html >> (See below the response from zwu at jhsph.edu >> >> The original question I got was about the bimodal >> distribution of gcrma >> result from probe intensities with unimodel >> distribution. My answer was >> that the "change" was not necessarily surprising. >> >> For example , when you have "true log signal" from a >> bimodal distribution >> logS=c(rnorm(1000,3,1),rnorm(1000,8,2)) >> # You will see this has two peaks >> par(mfrow=c(2,2)) >> plot(density(logS)) >> #if the background, log(non-specific binding) come >> from >> logB=rnorm(2000,6,1) >> #then when you plot the histogram of convolution in >> log scale, >> plot(density(log(exp(logS)+exp(logB)))) >> #you see only one peak, and this would be "before >> gcrma". >> >> This explanation made sense to me, but seems to >> contradict what is stated in the paper. >> >> Also, can someone explain the difference between RMA >> background version1 vs version2? >> >> >> Best regards, >> Noel >> >> _______________________________________________ >> Bioconductor mailing list >> Bioconductor at stat.math.ethz.ch >> https://stat.ethz.ch/mailman/listinfo/bioconductor >> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor >> > > > -- ********************************************************************** ************* Dr Claus-D. Mayer | http://www.bioss.ac.uk Biomathematics & Statistics Scotland | email: claus at bioss.ac.uk Rowett Research Institute | Telephone: +44 (0) 1224 716652 Aberdeen AB21 9SB, Scotland, UK. | Fax: +44 (0) 1224 715349

ADD REPLY • link 17.9 years ago Claus Mayer ▴ 340

0

Entering edit mode

Hi all, I almost always see the bi-modality, and I think Claus has the right argument, that the first peak represents not expressed/very weakly expressed genes. Most Affy chips have fairly good coverage of the genome, and hence for almost any sample a good proportion of the genes will not be expressed. I find it to (GC)RMA's credit that it identifies the non-expressed genes so clearly. Additionally, I routinely filter out genes if they are called "Absent" on all arrays by the MAS 5 algorithm, and this greatly reduces the height of the first peak. Occasionally I even filtered out genes unless they were "Present" on all arrays just to see what would happen to the distribution, and low and behold, the bimodality disappears! So as opposed to seeing the bimodality as a problem, I view it as accurately representing the real expression distribution. Cheers, Jenny At 08:17 AM 6/6/2006, Claus Mayer wrote: >Hi Wolfgang (and everybody else)! As pointed out by you there are two >different issues here: a) the bi-modality of (GC)RMA normalized data on >many chips (which I have observed repeatedly now as well ), b) the >bi-modality of log(PM/MM) values as stated in the Irrizarry et al. paper. >In both cases the mathematical argument, that any continuous distribution >can be monotonely transformed into any other continuous distribution holds >(which is basically behind your statement that monotonous transformations >do not preserve the number of peaks/modes), but I still think, that the >observation a) of bi-modal distributions of gcrma normalized expression >values is worth to be discussed. Assuming GCRMA is good/perfect >normalisation method the normalised values should directly relate to the >"true" biological expressions and thus it is tempting to take such a >histogram as an indication of there being two classes of genes: i) genes >with no/small expression values (forming the first peak), ii) >truely/highly expressed genes (forming the second peak). If on the other >hand the bi-modality is an implicit by-product of the GCRMA- normalisation, >it doesn't make sense to interpret the bi-modality biologically in that >way. I have only limited experiences with Affy arrays so far, but at >least in one case the bi-modality also occured (but not so clearly) when >using MAS5 instead of GCRMA, which I took as an indication that in this >case, that GCRMA didn't create the two modes, but just made it easier to >distinguish between them. I would be interested to hear the experiences of >others in this respect. Best Wishes Claus Wolfgang Huber wrote: > Hi, > > >I am surprised why anybody is surprised about the different number of > >modes ("peaks"): the number of modes of a distribution is not conserved > >under monotonous transformations (such as the background correction in > >RMA), this simply follows from chain rule. > > See below for a simple >example with some "mock" microarray intensities z > and density of >log-transformed values before and after a (primitive) > background >background correction. > > Cheers > Wolfgang > > > set.seed(123) > > n = >100000 > z = 20 + exp(c(rnorm(n), 3+rnorm(n))) > > par(mfrow=c(1,2)) > >plot(density(log2(z))) > plot(density(log2(z-20))) > > > >noel0925 at sbcglobal.net wrote: > >> In the paper: Exploration, >Normalization and Summaries >> of High Density Oligonucleotide Array Probe >Level Data >> the following statement regarding the >> bimodality of >log2(PM) values and RMA background >> corrected PM values can be found- >"The same bimodal >> effect is seen when we stratisfy by log2(PM), thus >it >> is not an artifact of conditioning on sums." (p4). >> I am a little >confused by this as I thought that >> indeed an artifact of the >convolution! >> >> Clearly, the background corrected intensity >> values >are given by E(S | O) or the conditional >> expectation of the signal >given what we observe; where >> the observed signal is the convolution of >a normally >> distributed background (N) mean mu variance sigma^2 >> (B~ >N(u, ????^2)) and an exponentially distributed >> signal (S) with mean >alpha (S~ exp(????)). >> >> There have been several postings regarding >this matter >> in the Bioconductor archives and all seem to point to >> >this. Have I misunderstood? >> >> In particular was the following post: >> >https://stat.ethz.ch/pipermail/bioconductor/2004-August/005908.html >> >(See below the response from zwu at jhsph.edu >> >> The original question >I got was about the bimodal >> distribution of gcrma >> result from probe >intensities with unimodel >> distribution. My answer was >> that the >"change" was not necessarily surprising. >> >> For example , when you have >"true log signal" from a >> bimodal distribution >> >logS=c(rnorm(1000,3,1),rnorm(1000,8,2)) >> # You will see this has two >peaks >> par(mfrow=c(2,2)) >> plot(density(logS)) >> #if the background, >log(non-specific binding) come >> from >> logB=rnorm(2000,6,1) >> #then >when you plot the histogram of convolution in >> log scale, >> >plot(density(log(exp(logS)+exp(logB)))) >> #you see only one peak, and >this would be "before >> gcrma". >> >> This explanation made sense to me, >but seems to >> contradict what is stated in the paper. >> >> Also, can >someone explain the difference between RMA >> background version1 vs >version2? >> >> >> Best regards, >> Noel >> >> >_______________________________________________ >> Bioconductor mailing >list >> Bioconductor at stat.math.ethz.ch >> >https://stat.ethz.ch/mailman/listinfo/bioconductor >> Search the archives: >http://news.gmane.org/gmane.science.biology.informatics.conductor >> > > > > -- >********************************************************************* ************** >Dr Claus-D. Mayer | http://www.bioss.ac.uk >Biomathematics & Statistics Scotland | email: claus at bioss.ac.uk Rowett >Research Institute | Telephone: +44 (0) 1224 716652 Aberdeen >AB21 9SB, Scotland, UK. | Fax: +44 (0) 1224 715349 >_______________________________________________ Bioconductor mailing list >Bioconductor at stat.math.ethz.ch >https://stat.ethz.ch/mailman/listinfo/bioconductor Search the archives: >http://news.gmane.org/gmane.science.biology.informatics.conductor Jenny Drnevich, Ph.D. Functional Genomics Bioinformatics Specialist W.M. Keck Center for Comparative and Functional Genomics Roy J. Carver Biotechnology Center University of Illinois, Urbana-Champaign 330 ERML 1201 W. Gregory Dr. Urbana, IL 61801 USA ph: 217-244-7355 fax: 217-265-5066 e-mail: drnevich at uiuc.edu

ADD REPLY • link 17.9 years ago Jenny Drnevich ★ 2.2k

0

Entering edit mode

Hi Claus, Regarding a), I think it is not helpful to talk about the number of peaks in the distribution of microarray data, normalized or unnormalized, unless you are very precise about what transformation, pre-processing, scanner settings, etc. you apply. That is, that number is unlikely to be of any biological significance. See below. b) is an entirely different issue, it is simply an artifact of the way MMs are defined, but (GC)RMA does not use PM/MM ratios. Please try out the example I gave, it is stronger than just the trivial observation that any distribution can be mapped into any other through a suitable non-linear mapping: set.seed(123) n = 100000 z = 20 + exp(c(rnorm(n), 3+rnorm(n))) par(mfrow=c(1,3)) plot(density(log2(z))) plot(density(log2(z-20))) x = seq(min(z), max(z), length=100) plot(x, log2(2^x-20), type="l") The function x -> log2(2^x-20) is concave and looks quite "well-behaved". The densities of log2(z) and log2(z-20) look quite different, and similar effects might result from subtly different preprocessing strategies, or different scanner settings etc. Best wishes Wolfgang ------------------------------------------------------------------ Wolfgang Huber EBI/EMBL Cambridge UK http://www.ebi.ac.uk/huber Claus Mayer wrote: > Hi Wolfgang (and everybody else)! > > As pointed out by you there are two different issues here: a) the > bi-modality of (GC)RMA normalized data on many chips (which I have > observed repeatedly now as well ), b) the bi-modality of log(PM/MM) > values as stated in the Irrizarry et al. paper. > > In both cases the mathematical argument, that any continuous > distribution can be monotonely transformed into any other continuous > distribution holds (which is basically behind your statement that > monotonous transformations do not preserve the number of peaks/modes), > but I still think, that the observation a) of bi-modal distributions of > gcrma normalized expression values is worth to be discussed. > Assuming GCRMA is good/perfect normalisation method the normalised > values should directly relate to the "true" biological expressions and > thus it is tempting to take such a histogram as an indication of there > being two classes of genes: i) genes with no/small expression values > (forming the first peak), ii) truely/highly expressed genes (forming the > second peak). > If on the other hand the bi-modality is an implicit by-product of the > GCRMA-normalisation, it doesn't make sense to interpret the bi- modality > biologically in that way. > > I have only limited experiences with Affy arrays so far, but at least > in one case the bi-modality also occured (but not so clearly) when using > MAS5 instead of GCRMA, which I took as an indication that in this case, > that GCRMA didn't create the two modes, but just made it easier to > distinguish between them. I would be interested to hear the experiences > of others in this respect. > > Best Wishes > > Claus >

ADD REPLY • link 17.9 years ago Wolfgang Huber ★ 13k

0

Entering edit mode

peter.warren@verizon.net ▴ 40

@peterwarrenverizonnet-1368

Last seen 9.7 years ago

Hi, Wolfgang, Noel, It is true that a non-linear transformation can change the number of nodes of the data, and that that transformation can be sufficient to explain the bimodality we see in background-corrected data. However, in my experience, the raw probe-level data is itself bimodal. When there is some real signal present, the probe-level intensities are actually from two different distributions. The first ("absent") is where there is no positive transcript binding, only cross-hyb, non-specific binding, and background. The second ("present") is all that, plus true target transcript binding. This bimodality is more evident with log-transformed values. (In contrast, a log-transformation of a truly unimodal distribution, such as density(rnorm(...), is still unimodal.) In every case I've looked at, the "absent" distribution dwarfs the "present" one, so it often looks like one mode, before log transformation. After log transformation, I have been unable to model the data successfully with a single distribution; it always takes two. Regards, - Peter Warren > Hi Noel, > > > Just so that I am clear- the point is that the > > bimodality is not an artifact of the convolution, but > > simply the fact that the number of modes of a > > distribution is not conserved under monotonous > > transformations. > > No, I did not say that, and I do not know how to understand > this sentence, > since "the convolution" is directly related to "the monotonous > transformation" that we are talking about > > > This is why the paper points to the > > fact that the histograms of log2 (PMs/MMs) stratified > > by log2(PMs) is bimodal > > I leave the exegesis of the paper to its authors. > > > -so bimodality is a more > > general property of the probe level data. > > As you have just said yourself, the number of modes is not a > property of > the data, but of the data plus the particular (non-linear) > transformation > that you choose to apply to them. > > > Best wishes > Wolfgang.

ADD COMMENT • link 17.9 years ago peter.warren@verizon.net ▴ 40

0

Entering edit mode

Hi Peter, - doesn't the distribution of mRNA abundances (i.e. physical concentrations measured e.g. in average no. of molecules per cell) span the whole range from just undetectably above zero to very large? I am not sure what mechanism would then result in two distinct peaks of fluorescences, one for "absent" and and one for "present" mRNAs. - I tried find a definition of "truely unimodal distributions" (and I suppose, "falsely unimodal distributions"), but couldn't find one, can you advise? Cheers Wolfgang Peter G. Warren wrote: > Hi, Wolfgang, Noel, > > It is true that a non-linear transformation can change the number of nodes > of the data, and that that transformation can be sufficient to explain the > bimodality we see in background-corrected data. However, in my experience, > the raw probe-level data is itself bimodal. When there is some real signal > present, the probe-level intensities are actually from two different > distributions. The first ("absent") is where there is no positive transcript > binding, only cross-hyb, non-specific binding, and background. The second > ("present") is all that, plus true target transcript binding. This > bimodality is more evident with log-transformed values. (In contrast, a > log-transformation of a truly unimodal distribution, such as > density(rnorm(...), is still unimodal.) In every case I've looked at, the > "absent" distribution dwarfs the "present" one, so it often looks like one > mode, before log transformation. After log transformation, I have been > unable to model the data successfully with a single distribution; it always > takes two. > > Regards, > - Peter Warren > >> Hi Noel, >> >>> Just so that I am clear- the point is that the >>> bimodality is not an artifact of the convolution, but >>> simply the fact that the number of modes of a >>> distribution is not conserved under monotonous >>> transformations. >> No, I did not say that, and I do not know how to understand >> this sentence, >> since "the convolution" is directly related to "the monotonous >> transformation" that we are talking about >> >>> This is why the paper points to the >>> fact that the histograms of log2 (PMs/MMs) stratified >>> by log2(PMs) is bimodal >> I leave the exegesis of the paper to its authors. >> >>> -so bimodality is a more >>> general property of the probe level data. >> As you have just said yourself, the number of modes is not a >> property of >> the data, but of the data plus the particular (non-linear) >> transformation >> that you choose to apply to them. >> >> >> Best wishes >> Wolfgang. > -- ------------------------------------------------------------------ Wolfgang Huber EBI/EMBL Cambridge UK http://www.ebi.ac.uk/huber

ADD REPLY • link 17.9 years ago Wolfgang Huber ★ 13k

0

Entering edit mode

Hi, Wofgang, Yes, mRNA abundances do indeed span the whole range. What I meant was that all the distributions of intensities I have observed seem to be poorly modeled by a single mathematical distribution (that is what I meant by my poor choice of the term "truly unimodal"). Rather, two overlaid (added) distributions seem to model the observed data better, with the first distribution (presumably "absent") spanning the lower part of the range, and the second ("present", presumably modeling the mRNA abundances) spanning the *entire* range, but with a higher mean. The lower distribution would represent the much larger set of probes whose intensities are due only to cross-hyb, NSB, and background, with no true target mRNA signal. Its density peak is therefore much higher than the other. Although there is significant overlap between the two, the two means are separated and distinct, so the sum is a bimodal distribution. Again, this is simply based on my observations of log2-transformed intensity values. In fact, isn't it the main purpose of any of the intensity processing methods (MAS, RMA, GCRMA, etc.) to detect and increase the difference between the two distributions, so as to help distinguish signal from noise? Regards, - Peter > -----Original Message----- > From: Wolfgang Huber [mailto:huber at ebi.ac.uk] > Sent: Tuesday, June 06, 2006 12:15 PM > To: Peter G. Warren > Cc: noel0925 at sbcglobal.net; bioconductor at stat.math.ethz.ch > Subject: Re: [BioC] RMA-bimodality: > > Hi Peter, > > - doesn't the distribution of mRNA abundances (i.e. physical > concentrations measured e.g. in average no. of molecules per > cell) span the whole range from just undetectably above zero > to very large? I am not sure what mechanism would then result > in two distinct peaks of fluorescences, one for "absent" and > and one for "present" mRNAs. > > - I tried find a definition of "truely unimodal > distributions" (and I suppose, "falsely unimodal > distributions"), but couldn't find one, can you advise? > > Cheers > Wolfgang > > Peter G. Warren wrote: > > Hi, Wolfgang, Noel, > > > > It is true that a non-linear transformation can change the > number of > > nodes of the data, and that that transformation can be > sufficient to > > explain the bimodality we see in background-corrected data. > However, > > in my experience, the raw probe-level data is itself bimodal. When > > there is some real signal present, the probe-level intensities are > > actually from two different distributions. The first ("absent") is > > where there is no positive transcript binding, only cross-hyb, > > non-specific binding, and background. The second > > ("present") is all that, plus true target transcript binding. This > > bimodality is more evident with log-transformed values. (In > contrast, > > a log-transformation of a truly unimodal distribution, such as > > density(rnorm(...), is still unimodal.) In every case I've > looked at, > > the "absent" distribution dwarfs the "present" one, so it > often looks > > like one mode, before log transformation. After log > transformation, I > > have been unable to model the data successfully with a single > > distribution; it always takes two. > > > > Regards, > > - Peter Warren > > > >> Hi Noel, > >> > >>> Just so that I am clear- the point is that the bimodality > is not an > >>> artifact of the convolution, but simply the fact that the > number of > >>> modes of a distribution is not conserved under monotonous > >>> transformations. > >> No, I did not say that, and I do not know how to understand this > >> sentence, since "the convolution" is directly related to "the > >> monotonous transformation" that we are talking about > >> > >>> This is why the paper points to the > >>> fact that the histograms of log2 (PMs/MMs) stratified by > log2(PMs) > >>> is bimodal > >> I leave the exegesis of the paper to its authors. > >> > >>> -so bimodality is a more > >>> general property of the probe level data. > >> As you have just said yourself, the number of modes is not > a property > >> of the data, but of the data plus the particular (non-linear) > >> transformation that you choose to apply to them. > >> > >> > >> Best wishes > >> Wolfgang. > > > > > -- > ------------------------------------------------------------------ > Wolfgang Huber EBI/EMBL Cambridge UK http://www.ebi.ac.uk/huber > >

ADD REPLY • link 17.9 years ago peter.warren@verizon.net ▴ 40

0

Entering edit mode

Hi, Wolfgang, Just a short follow-up. The example you provided to Noel has embedded within it two distributions. Your example is as follows: set.seed(123) n = 100000 z = 20 + exp(c(rnorm(n), 3+rnorm(n))) par(mfrow=c(1,2)) plot(density(log2(z))) plot(density(log2(z-20))) Continuing, if we separate the two distributions in z and overlay them, the plot illustrates the point I've been attempting to make: z1=20+exp(rnorm(n)) z2=20+exp(3+rnorm(n)) # Note that z = c(z1+z2) # The left plot shows your bimodal combined density, before the "background correction", as before. plot(density(log2(z))) # The right plot shows the two component distributions separately. plot(density(log2(z1))) lines(density(log2(z2))) Your example in fact models quite well what I see with real, uncorrected data. If the two distributions are not "absent" and "present" intensity distributions, I'm open to other suggestions. Regards, - Peter > -----Original Message----- > From: Wolfgang Huber [mailto:huber at ebi.ac.uk] > Sent: Tuesday, June 06, 2006 12:15 PM > To: Peter G. Warren > Cc: noel0925 at sbcglobal.net; bioconductor at stat.math.ethz.ch > Subject: Re: [BioC] RMA-bimodality: > > Hi Peter, > > - doesn't the distribution of mRNA abundances (i.e. physical > concentrations measured e.g. in average no. of molecules per > cell) span the whole range from just undetectably above zero > to very large? I am not sure what mechanism would then result > in two distinct peaks of fluorescences, one for "absent" and > and one for "present" mRNAs. > > - I tried find a definition of "truely unimodal > distributions" (and I suppose, "falsely unimodal > distributions"), but couldn't find one, can you advise? > > Cheers > Wolfgang > > Peter G. Warren wrote: > > Hi, Wolfgang, Noel, > > > > It is true that a non-linear transformation can change the > number of > > nodes of the data, and that that transformation can be > sufficient to > > explain the bimodality we see in background-corrected data. > However, > > in my experience, the raw probe-level data is itself bimodal. When > > there is some real signal present, the probe-level intensities are > > actually from two different distributions. The first ("absent") is > > where there is no positive transcript binding, only cross-hyb, > > non-specific binding, and background. The second > > ("present") is all that, plus true target transcript binding. This > > bimodality is more evident with log-transformed values. (In > contrast, > > a log-transformation of a truly unimodal distribution, such as > > density(rnorm(...), is still unimodal.) In every case I've > looked at, > > the "absent" distribution dwarfs the "present" one, so it > often looks > > like one mode, before log transformation. After log > transformation, I > > have been unable to model the data successfully with a single > > distribution; it always takes two. > > > > Regards, > > - Peter Warren > > > >> Hi Noel, > >> > >>> Just so that I am clear- the point is that the bimodality > is not an > >>> artifact of the convolution, but simply the fact that the > number of > >>> modes of a distribution is not conserved under monotonous > >>> transformations. > >> No, I did not say that, and I do not know how to understand this > >> sentence, since "the convolution" is directly related to "the > >> monotonous transformation" that we are talking about > >> > >>> This is why the paper points to the > >>> fact that the histograms of log2 (PMs/MMs) stratified by > log2(PMs) > >>> is bimodal > >> I leave the exegesis of the paper to its authors. > >> > >>> -so bimodality is a more > >>> general property of the probe level data. > >> As you have just said yourself, the number of modes is not > a property > >> of the data, but of the data plus the particular (non-linear) > >> transformation that you choose to apply to them. > >> > >> > >> Best wishes > >> Wolfgang. > > > > > -- > ------------------------------------------------------------------ > Wolfgang Huber EBI/EMBL Cambridge UK http://www.ebi.ac.uk/huber > >

ADD REPLY • link 17.9 years ago peter.warren@verizon.net ▴ 40

0

Entering edit mode

Phguardiol@aol.com ▴ 720

@phguardiolaolcom-152

Last seen 9.7 years ago

An embedded and charset-unspecified text was scrubbed... Name: not available Url: https://stat.ethz.ch/pipermail/bioconductor/attachments/20060606/ f26425fe/attachment.pl

ADD COMMENT • link 17.9 years ago Phguardiol@aol.com ▴ 720

0

Entering edit mode

Hi Philip, the bimodality we discuss here is ACROSS genes. Normality/Unimodality that is assumed for parametric tests concerns the error distribution WITHIN the replicates for the same gene, i.e. the bi-modality of GCRMA normalized data across gene does not have any direct implications on whether a parametric or non-parametric test is preferable. Regards, Claus Phguardiol at aol.com wrote: > Hi, > these are comments/questions from a neophyte who asked questions about this, > one or 2 years ago on this list, and did not get clear answer ! > is this bimodal distribution observed with GCRMA against the use of > parametric tests, does it favor the use of non parametric tests ? Said differently, > should this distribution affect our choice regarding statistical tests for > subsequent analyses ? > -- ********************************************************************** ************* Dr Claus-D. Mayer | http://www.bioss.ac.uk Biomathematics & Statistics Scotland | email: claus at bioss.ac.uk Rowett Research Institute | Telephone: +44 (0) 1224 716652 Aberdeen AB21 9SB, Scotland, UK. | Fax: +44 (0) 1224 715349

ADD REPLY • link 17.9 years ago Claus Mayer ▴ 340

Login before adding your answer.