Question: distribution of agilent array data.‏
1
5.7 years ago by
shao80
Germany
shao80 wrote:
Hi everyone, I am confused by the histogram of normalized Agilent microarray data. It is human single color array, containing around 700 microarrays and 43K probes. After normalization, I plotted the express value of all probes in single microarray, one example is attached. I expected to see a more or less symmetric distribution, however, the values seems truncated. In the beginning I thought it may relate to offset value, but I have tried different value 16, 1, 0, still got similar distribution. Any explanation or suggestions? Here are codes for normalization: library(limma) targets <- readTargets("targets.txt") x <- read.maimages(targets, source="agilent",green.only=TRUE) y.bg <- backgroundCorrect(x, method="normexp") y.bgn <- normalizeBetweenArraysy.bg, method="quantile") g.ex <- avereps(y.bgn, ID=y.bgn$genes$ProbeName) da.norm <- g.ex$E Here are R session: R version 3.0.2 (2013-09-25) Platform: x86_64-apple-darwin10.8.0 (64-bit) locale: [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8 attached base packages: [1] graphics grDevices utils datasets stats methods base other attached packages: [1] ggplot2_0.9.3.1 reshape2_1.2.2 plyr_1.8 loaded via a namespace (and not attached): [1] colorspace_1.2-4 dichromat_2.0-0 digest_0.6.3 grid_3.0.2 gtable_0.1.2 labeling_0.2 [7] MASS_7.3-29 munsell_0.4.2 proto_0.3-10 RColorBrewer_1.0-5 scales_0.2.3 stringr_0.6.2 Best, chunxuan -------------- next part -------------- A non-text attachment was scrubbed... Name: Rplot.pdf Type: application/pdf Size: 83343 bytes Desc: not available URL: <https: stat.ethz.ch="" pipermail="" bioconductor="" attachments="" 20131110="" f165edfb="" attachment.pdf=""> microarray normalization • 697 views ADD COMMENTlink modified 5.7 years ago by shao chunxuan70 • written 5.7 years ago by shao80 Answer: distribution of agilent array data.‏ 0 5.7 years ago by United States James W. MacDonald50k wrote: Hi Chunxuan, On Sunday, November 10, 2013 3:19:44 PM, shao chunxuan wrote: > > > > > > > Hi everyone, > > I am confused by the histogram of normalized Agilent microarray data. > It is human single color array, containing around 700 microarrays and 43K probes. > > After normalization, I plotted the express value of all probes in single microarray, one example is attached. > > I > expected to see a more or less symmetric distribution, however, the > values seems truncated. In the beginning I thought it may relate to > offset value, but I have tried different value 16, 1, 0, still got > similar distribution. Why would you expect a symmetric distribution? Also, plotting a histogram with such large bin sizes isn't very helpful - I wouldn't be willing to say much about the distribution based on that plot anyway. A more reasonable expectation is something like a convolution of a lognormal and an exponential distribution. In other words, there are likely a large number of genes that aren't expressed, and the distribution of those probes will be symmetrical around some small number. And the distribution of expressed genes is likely to be something like an exponential, with a long right tail. And since you used the normexp background correction, you made the same assumption as well. Best, Jim > > Any explanation or suggestions? > > Here are codes for normalization: > library(limma) > targets <- readTargets("targets.txt") > x <- read.maimages(targets, source="agilent",green.only=TRUE) > y.bg <- backgroundCorrect(x, method="normexp") > y.bgn <- normalizeBetweenArraysy.bg, method="quantile") > g.ex <- avereps(y.bgn, ID=y.bgn$genes$ProbeName) > da.norm <- g.ex$E > > Here are R session: > R version 3.0.2 (2013-09-25) > Platform: x86_64-apple-darwin10.8.0 (64-bit) > > locale: > [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8 > > attached base packages: > [1] graphics grDevices utils datasets stats methods base > > other attached packages: > [1] ggplot2_0.9.3.1 reshape2_1.2.2 plyr_1.8 > > loaded via a namespace (and not attached): > [1] colorspace_1.2-4 dichromat_2.0-0 digest_0.6.3 grid_3.0.2 gtable_0.1.2 labeling_0.2 > [7] MASS_7.3-29 munsell_0.4.2 proto_0.3-10 RColorBrewer_1.0-5 scales_0.2.3 stringr_0.6.2 > Best, > > chunxuan > > > > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor -- James W. MacDonald, M.S. Biostatistician University of Washington Environmental and Occupational Health Sciences 4225 Roosevelt Way NE, # 100 Seattle WA 98105-6099
Hi Jim, Thanks for the helpful comments. I have this question partially because the data has bee been normalized by other people, in which the distribution is more or less symmetric. The alternative codes are: library(limma)targets <- readTargets("targets.txt")x <- read.maimages(targets, source="agilent",columns=list(R="gMedianSignal", Rb="gBGMedianSignal", G="gMedianSignal", Gb="gBGMedianSignal"))y.bg <- backgroundCorrect(x, method="normexp")eset <- y.bg$G ## log2 transform before normalization!!!eset.l <- round(log2(eset), 4)y.bgn.l.2 <- normalizeBetweenArrays(eset.l, method="quantile") I got a bell shape distribution for all probes in a single array if log2 transformed before normalization. It is an old question whether to log2 first, but in my data, it doesn't matter, I found that the boxplot for a single genes across patients are identical, and the signature genes can separated patients very well in both conditions. Best, Chunuxan > Date: Sun, 10 Nov 2013 19:45:43 -0500 > From: jmacdon@uw.edu > To: hibergo@outlook.com > CC: bioconductor@r-project.org > Subject: Re: [BioC] distribution of agilent array data.þ > > Hi Chunxuan, > > On Sunday, November 10, 2013 3:19:44 PM, shao chunxuan wrote: > > > > > > > > > > > > > > Hi everyone, > > > > I am confused by the histogram of normalized Agilent microarray data. > > It is human single color array, containing around 700 microarrays and 43K probes. > > > > After normalization, I plotted the express value of all probes in single microarray, one example is attached. > > > > I > > expected to see a more or less symmetric distribution, however, the > > values seems truncated. In the beginning I thought it may relate to > > offset value, but I have tried different value 16, 1, 0, still got > > similar distribution. > > Why would you expect a symmetric distribution? Also, plotting a > histogram with such large bin sizes isn't very helpful - I wouldn't be > willing to say much about the distribution based on that plot anyway. > > A more reasonable expectation is something like a convolution of a > lognormal and an exponential distribution. In other words, there are > likely a large number of genes that aren't expressed, and the > distribution of those probes will be symmetrical around some small > number. And the distribution of expressed genes is likely to be > something like an exponential, with a long right tail. And since you > used the normexp background correction, you made the same assumption as > well. > > Best, > > Jim > > > > > > Any explanation or suggestions? > > > > Here are codes for normalization: > > library(limma) > > targets <- readTargets("targets.txt") > > x <- read.maimages(targets, source="agilent",green.only=TRUE) > > y.bg <- backgroundCorrect(x, method="normexp") > > y.bgn <- normalizeBetweenArraysy.bg, method="quantile") > > g.ex <- avereps(y.bgn, ID=y.bgn$genes$ProbeName) > > da.norm <- g.ex$E > > > > Here are R session: > > R version 3.0.2 (2013-09-25) > > Platform: x86_64-apple-darwin10.8.0 (64-bit) > > > > locale: > > [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8 > > > > attached base packages: > > [1] graphics grDevices utils datasets stats methods base > > > > other attached packages: > > [1] ggplot2_0.9.3.1 reshape2_1.2.2 plyr_1.8 > > > > loaded via a namespace (and not attached): > > [1] colorspace_1.2-4 dichromat_2.0-0 digest_0.6.3 grid_3.0.2 gtable_0.1.2 labeling_0.2 > > [7] MASS_7.3-29 munsell_0.4.2 proto_0.3-10 RColorBrewer_1.0-5 scales_0.2.3 stringr_0.6.2 > > Best, > > > > chunxuan > > > > > > > > > > _______________________________________________ > > Bioconductor mailing list > > Bioconductor@r-project.org > > https://stat.ethz.ch/mailman/listinfo/bioconductor > > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor > > -- > James W. MacDonald, M.S. > Biostatistician > University of Washington > Environmental and Occupational Health Sciences > 4225 Roosevelt Way NE, # 100 > Seattle WA 98105-6099 [[alternative HTML version deleted]]
Hello Chunuxuan, I have worked with Agilent arrays a lot and can confirm Jim's comment the type of distribution you show (with a heavy right tail) is fairly typical. If you follow Jim's advice of smaller bin sizes/ more bins (nclass =100 or so) you will probably see that there is some mass of the distribution to left of the peak/mode (as you would expect from normexp). I guess what might be confusing you is that the normalisation + logging is supposed to give you normally distributed data (or at least something not so very far away from it) which are symmetrically distributed. But this is a statement about the distribution for the replicates WITHIN genes, not across genes. Best Wishes Claus Dr. Claus-D. Mayer Biomathematics & Statistics Scotland (BioSS) Rowett Institute of Nutrition and Health University of Aberdeen Aberdeen AB21 9SB, Scotland, UK. email: claus at bioss.ac.uk or c.mayer at abdn.ac.uk Telephone: +44 (0) 1224 438652 Biomathematics and Statistics Scotland (BioSS) is formally part of The James Hutton Institute, a registered Scottish charity No. SC041796 and a company limited by guarantee No. SC374831 > -----Original Message----- > From: bioconductor-bounces at r-project.org [mailto:bioconductor- > bounces at r-project.org] On Behalf Of shao chunxuan > Sent: 11 November 2013 15:52 > To: James W. MacDonald > Cc: bioconductor > Subject: Re: [BioC] distribution of agilent array data.? > > Hi Jim, > Thanks for the helpful comments. > I have this question partially because the data has bee been normalized > by other people, in which the distribution is more or less symmetric. > The alternative codes are: > library(limma)targets <- readTargets("targets.txt")x <- > read.maimages(targets, source="agilent",columns=list(R="gMedianSignal", > Rb="gBGMedianSignal", G="gMedianSignal", Gb="gBGMedianSignal"))y.bg <- > backgroundCorrect(x, method="normexp")eset <- y.bg$G ## log2 transform > before normalization!!!eset.l <- round(log2(eset), 4)y.bgn.l.2 <- > normalizeBetweenArrays(eset.l, method="quantile") I got a bell shape > distribution for all probes in a single array if log2 transformed > before normalization. It is an old question whether to log2 first, but > in my data, it doesn't matter, I found that the boxplot for a single > genes across patients are identical, and the signature genes can > separated patients very well in both conditions. > > Best, > Chunuxan > > > > Date: Sun, 10 Nov 2013 19:45:43 -0500 > > From: jmacdon at uw.edu > > To: hibergo at outlook.com > > CC: bioconductor at r-project.org > > Subject: Re: [BioC] distribution of agilent array data.? > > > > Hi Chunxuan, > > > > On Sunday, November 10, 2013 3:19:44 PM, shao chunxuan wrote: > > > > > > > > > > > > > > > > > > > > > Hi everyone, > > > > > > I am confused by the histogram of normalized Agilent microarray > data. > > > It is human single color array, containing around 700 microarrays > and 43K probes. > > > > > > After normalization, I plotted the express value of all probes in > single microarray, one example is attached. > > > > > > I > > > expected to see a more or less symmetric distribution, however, > > > the values seems truncated. In the beginning I thought it may > relate > > > to offset value, but I have tried different value 16, 1, 0, still > > > got similar distribution. > > > > Why would you expect a symmetric distribution? Also, plotting a > > histogram with such large bin sizes isn't very helpful - I wouldn't > be > > willing to say much about the distribution based on that plot anyway. > > > > A more reasonable expectation is something like a convolution of a > > lognormal and an exponential distribution. In other words, there are > > likely a large number of genes that aren't expressed, and the > > distribution of those probes will be symmetrical around some small > > number. And the distribution of expressed genes is likely to be > > something like an exponential, with a long right tail. And since you > > used the normexp background correction, you made the same assumption > > as well. > > > > Best, > > > > Jim > > > > > > > > > > Any explanation or suggestions? > > > > > > Here are codes for normalization: > > > library(limma) > > > targets <- readTargets("targets.txt") x <- read.maimages(targets, > > > source="agilent",green.only=TRUE) y.bg <- backgroundCorrect(x, > > > method="normexp") y.bgn <- normalizeBetweenArraysy.bg, > > > method="quantile") g.ex <- avereps(y.bgn, ID=y.bgn$genes$ProbeName) > > > da.norm <- g.ex$E > > > > > > Here are R session: > > > R version 3.0.2 (2013-09-25) > > > Platform: x86_64-apple-darwin10.8.0 (64-bit) > > > > > > locale: > > > [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8 > > > > > > attached base packages: > > > [1] graphics grDevices utils datasets stats methods > base > > > > > > other attached packages: > > > [1] ggplot2_0.9.3.1 reshape2_1.2.2 plyr_1.8 > > > > > > loaded via a namespace (and not attached): > > > [1] colorspace_1.2-4 dichromat_2.0-0 digest_0.6.3 > grid_3.0.2 gtable_0.1.2 labeling_0.2 > > > [7] MASS_7.3-29 munsell_0.4.2 proto_0.3-10 > RColorBrewer_1.0-5 scales_0.2.3 stringr_0.6.2 > > > Best, > > > > > > chunxuan > > > > > > > > > > > > > > > _______________________________________________ > > > Bioconductor mailing list > > > Bioconductor at r-project.org > > > https://stat.ethz.ch/mailman/listinfo/bioconductor > > > Search the archives: > > > http://news.gmane.org/gmane.science.biology.informatics.conductor > > > > -- > > James W. MacDonald, M.S. > > Biostatistician > > University of Washington > > Environmental and Occupational Health Sciences > > 4225 Roosevelt Way NE, # 100 > > Seattle WA 98105-6099 > > [[alternative HTML version deleted]] The University of Aberdeen is a charity registered in Scotland, No SC013683.
Answer: distribution of agilent array data.‏
0
5.7 years ago by
shao chunxuan70 wrote:
Hi Mayer, Thanks for your reply, I fully agree the bell shape curve is observed for a single genes across sample. Best, > From: c.mayer@abdn.ac.uk > To: hibergo@outlook.com; jmacdon@uw.edu > CC: bioconductor@r-project.org > Subject: RE: [BioC] distribution of agilent array data.â > Date: Mon, 11 Nov 2013 18:04:31 +0000 > > Hello Chunuxuan, > > I have worked with Agilent arrays a lot and can confirm Jim's comment the type of distribution you show (with a heavy right tail) is fairly typical. If you follow Jim's advice of smaller bin sizes/ more bins (nclass =100 or so) you will probably see that there is some mass of the distribution to left of the peak/mode (as you would expect from normexp). > > I guess what might be confusing you is that the normalisation + logging is supposed to give you normally distributed data (or at least something not so very far away from it) which are symmetrically distributed. But this is a statement about the distribution for the replicates WITHIN genes, not across genes. > > Best Wishes > > Claus > > Dr. Claus-D. Mayer > Biomathematics & Statistics Scotland (BioSS) > Rowett Institute of Nutrition and Health > University of Aberdeen > Aberdeen AB21 9SB, Scotland, UK. > email: claus@bioss.ac.uk or c.mayer@abdn.ac.uk > Telephone: +44 (0) 1224 438652 > > Biomathematics and Statistics Scotland (BioSS) is formally part of The James Hutton Institute, > a registered Scottish charity No. SC041796 and a company limited by guarantee No. SC374831 > > > > -----Original Message----- > > From: bioconductor-bounces@r-project.org [mailto:bioconductor- > > bounces@r-project.org] On Behalf Of shao chunxuan > > Sent: 11 November 2013 15:52 > > To: James W. MacDonald > > Cc: bioconductor > > Subject: Re: [BioC] distribution of agilent array data.â > > > > Hi Jim, > > Thanks for the helpful comments. > > I have this question partially because the data has bee been normalized > > by other people, in which the distribution is more or less symmetric. > > The alternative codes are: > > library(limma)targets <- readTargets("targets.txt")x <- > > read.maimages(targets, source="agilent",columns=list(R="gMedianSignal", > > Rb="gBGMedianSignal", G="gMedianSignal", Gb="gBGMedianSignal"))y.bg <- > > backgroundCorrect(x, method="normexp")eset <- y.bg$G ## log2 transform > > before normalization!!!eset.l <- round(log2(eset), 4)y.bgn.l.2 <- > > normalizeBetweenArrays(eset.l, method="quantile") I got a bell shape > > distribution for all probes in a single array if log2 transformed > > before normalization. It is an old question whether to log2 first, but > > in my data, it doesn't matter, I found that the boxplot for a single > > genes across patients are identical, and the signature genes can > > separated patients very well in both conditions. > > > > Best, > > Chunuxan > > > > > > > Date: Sun, 10 Nov 2013 19:45:43 -0500 > > > From: jmacdon@uw.edu > > > To: hibergo@outlook.com > > > CC: bioconductor@r-project.org > > > Subject: Re: [BioC] distribution of agilent array data.Ã¾ > > > > > > Hi Chunxuan, > > > > > > On Sunday, November 10, 2013 3:19:44 PM, shao chunxuan wrote: > > > > > > > > > > > > > > > > > > > > > > > > > > > > Hi everyone, > > > > > > > > I am confused by the histogram of normalized Agilent microarray > > data. > > > > It is human single color array, containing around 700 microarrays > > and 43K probes. > > > > > > > > After normalization, I plotted the express value of all probes in > > single microarray, one example is attached. > > > > > > > > I > > > > expected to see a more or less symmetric distribution, however, > > > > the values seems truncated. In the beginning I thought it may > > relate > > > > to offset value, but I have tried different value 16, 1, 0, still > > > > got similar distribution. > > > > > > Why would you expect a symmetric distribution? Also, plotting a > > > histogram with such large bin sizes isn't very helpful - I wouldn't > > be > > > willing to say much about the distribution based on that plot anyway. > > > > > > A more reasonable expectation is something like a convolution of a > > > lognormal and an exponential distribution. In other words, there are > > > likely a large number of genes that aren't expressed, and the > > > distribution of those probes will be symmetrical around some small > > > number. And the distribution of expressed genes is likely to be > > > something like an exponential, with a long right tail. And since you > > > used the normexp background correction, you made the same assumption > > > as well. > > > > > > Best, > > > > > > Jim > > > > > > > > > > > > > > Any explanation or suggestions? > > > > > > > > Here are codes for normalization: > > > > library(limma) > > > > targets <- readTargets("targets.txt") x <- read.maimages(targets, > > > > source="agilent",green.only=TRUE) y.bg <- backgroundCorrect(x, > > > > method="normexp") y.bgn <- normalizeBetweenArraysy.bg, > > > > method="quantile") g.ex <- avereps(y.bgn, ID=y.bgn$genes$ProbeName) > > > > da.norm <- g.ex$E > > > > > > > > Here are R session: > > > > R version 3.0.2 (2013-09-25) > > > > Platform: x86_64-apple-darwin10.8.0 (64-bit) > > > > > > > > locale: > > > > [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8 > > > > > > > > attached base packages: > > > > [1] graphics grDevices utils datasets stats methods > > base > > > > > > > > other attached packages: > > > > [1] ggplot2_0.9.3.1 reshape2_1.2.2 plyr_1.8 > > > > > > > > loaded via a namespace (and not attached): > > > > [1] colorspace_1.2-4 dichromat_2.0-0 digest_0.6.3 > > grid_3.0.2 gtable_0.1.2 labeling_0.2 > > > > [7] MASS_7.3-29 munsell_0.4.2 proto_0.3-10 > > RColorBrewer_1.0-5 scales_0.2.3 stringr_0.6.2 > > > > Best, > > > > > > > > chunxuan > > > > > > > > > > > > > > > > > > > > _______________________________________________ > > > > Bioconductor mailing list > > > > Bioconductor@r-project.org > > > > https://stat.ethz.ch/mailman/listinfo/bioconductor > > > > Search the archives: > > > > http://news.gmane.org/gmane.science.biology.informatics.conductor > > > > > > -- > > > James W. MacDonald, M.S. > > > Biostatistician > > > University of Washington > > > Environmental and Occupational Health Sciences > > > 4225 Roosevelt Way NE, # 100 > > > Seattle WA 98105-6099 > > > > [[alternative HTML version deleted]] > > > > > > The University of Aberdeen is a charity registered in Scotland, No SC013683. [[alternative HTML version deleted]]