distribution of agilent array data.‏
2
1
Entering edit mode
shao ▴ 100
@shao-6241
Last seen 6.3 years ago
Germany
Hi everyone, I am confused by the histogram of normalized Agilent microarray data. It is human single color array, containing around 700 microarrays and 43K probes. After normalization, I plotted the express value of all probes in single microarray, one example is attached. I expected to see a more or less symmetric distribution, however, the values seems truncated. In the beginning I thought it may relate to offset value, but I have tried different value 16, 1, 0, still got similar distribution. Any explanation or suggestions? Here are codes for normalization: library(limma) targets <- readTargets("targets.txt") x <- read.maimages(targets, source="agilent",green.only=TRUE) y.bg <- backgroundCorrect(x, method="normexp") y.bgn <- normalizeBetweenArraysy.bg, method="quantile") g.ex <- avereps(y.bgn, ID=y.bgn$genes$ProbeName) da.norm <- g.ex$E Here are R session: R version 3.0.2 (2013-09-25) Platform: x86_64-apple-darwin10.8.0 (64-bit) locale: [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8 attached base packages: [1] graphics grDevices utils datasets stats methods base other attached packages: [1] ggplot2_0.9.3.1 reshape2_1.2.2 plyr_1.8 loaded via a namespace (and not attached): [1] colorspace_1.2-4 dichromat_2.0-0 digest_0.6.3 grid_3.0.2 gtable_0.1.2 labeling_0.2 [7] MASS_7.3-29 munsell_0.4.2 proto_0.3-10 RColorBrewer_1.0-5 scales_0.2.3 stringr_0.6.2 Best, chunxuan -------------- next part -------------- A non-text attachment was scrubbed... Name: Rplot.pdf Type: application/pdf Size: 83343 bytes Desc: not available URL: <https: stat.ethz.ch="" pipermail="" bioconductor="" attachments="" 20131110="" f165edfb="" attachment.pdf="">
Microarray Normalization Microarray Normalization • 1.3k views
ADD COMMENT
0
Entering edit mode
@james-w-macdonald-5106
Last seen 2 hours ago
United States
Hi Chunxuan, On Sunday, November 10, 2013 3:19:44 PM, shao chunxuan wrote: > > > > > > > Hi everyone, > > I am confused by the histogram of normalized Agilent microarray data. > It is human single color array, containing around 700 microarrays and 43K probes. > > After normalization, I plotted the express value of all probes in single microarray, one example is attached. > > I > expected to see a more or less symmetric distribution, however, the > values seems truncated. In the beginning I thought it may relate to > offset value, but I have tried different value 16, 1, 0, still got > similar distribution. Why would you expect a symmetric distribution? Also, plotting a histogram with such large bin sizes isn't very helpful - I wouldn't be willing to say much about the distribution based on that plot anyway. A more reasonable expectation is something like a convolution of a lognormal and an exponential distribution. In other words, there are likely a large number of genes that aren't expressed, and the distribution of those probes will be symmetrical around some small number. And the distribution of expressed genes is likely to be something like an exponential, with a long right tail. And since you used the normexp background correction, you made the same assumption as well. Best, Jim > > Any explanation or suggestions? > > Here are codes for normalization: > library(limma) > targets <- readTargets("targets.txt") > x <- read.maimages(targets, source="agilent",green.only=TRUE) > y.bg <- backgroundCorrect(x, method="normexp") > y.bgn <- normalizeBetweenArraysy.bg, method="quantile") > g.ex <- avereps(y.bgn, ID=y.bgn$genes$ProbeName) > da.norm <- g.ex$E > > Here are R session: > R version 3.0.2 (2013-09-25) > Platform: x86_64-apple-darwin10.8.0 (64-bit) > > locale: > [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8 > > attached base packages: > [1] graphics grDevices utils datasets stats methods base > > other attached packages: > [1] ggplot2_0.9.3.1 reshape2_1.2.2 plyr_1.8 > > loaded via a namespace (and not attached): > [1] colorspace_1.2-4 dichromat_2.0-0 digest_0.6.3 grid_3.0.2 gtable_0.1.2 labeling_0.2 > [7] MASS_7.3-29 munsell_0.4.2 proto_0.3-10 RColorBrewer_1.0-5 scales_0.2.3 stringr_0.6.2 > Best, > > chunxuan > > > > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor -- James W. MacDonald, M.S. Biostatistician University of Washington Environmental and Occupational Health Sciences 4225 Roosevelt Way NE, # 100 Seattle WA 98105-6099
ADD COMMENT
0
Entering edit mode
Hi Jim, Thanks for the helpful comments. I have this question partially because the data has bee been normalized by other people, in which the distribution is more or less symmetric. The alternative codes are: library(limma)targets <- readTargets("targets.txt")x <- read.maimages(targets, source="agilent",columns=list(R="gMedianSignal", Rb="gBGMedianSignal", G="gMedianSignal", Gb="gBGMedianSignal"))y.bg <- backgroundCorrect(x, method="normexp")eset <- y.bg$G ## log2 transform before normalization!!!eset.l <- round(log2(eset), 4)y.bgn.l.2 <- normalizeBetweenArrays(eset.l, method="quantile") I got a bell shape distribution for all probes in a single array if log2 transformed before normalization. It is an old question whether to log2 first, but in my data, it doesn't matter, I found that the boxplot for a single genes across patients are identical, and the signature genes can separated patients very well in both conditions. Best, Chunuxan > Date: Sun, 10 Nov 2013 19:45:43 -0500 > From: jmacdon@uw.edu > To: hibergo@outlook.com > CC: bioconductor@r-project.org > Subject: Re: [BioC] distribution of agilent array data.þ > > Hi Chunxuan, > > On Sunday, November 10, 2013 3:19:44 PM, shao chunxuan wrote: > > > > > > > > > > > > > > Hi everyone, > > > > I am confused by the histogram of normalized Agilent microarray data. > > It is human single color array, containing around 700 microarrays and 43K probes. > > > > After normalization, I plotted the express value of all probes in single microarray, one example is attached. > > > > I > > expected to see a more or less symmetric distribution, however, the > > values seems truncated. In the beginning I thought it may relate to > > offset value, but I have tried different value 16, 1, 0, still got > > similar distribution. > > Why would you expect a symmetric distribution? Also, plotting a > histogram with such large bin sizes isn't very helpful - I wouldn't be > willing to say much about the distribution based on that plot anyway. > > A more reasonable expectation is something like a convolution of a > lognormal and an exponential distribution. In other words, there are > likely a large number of genes that aren't expressed, and the > distribution of those probes will be symmetrical around some small > number. And the distribution of expressed genes is likely to be > something like an exponential, with a long right tail. And since you > used the normexp background correction, you made the same assumption as > well. > > Best, > > Jim > > > > > > Any explanation or suggestions? > > > > Here are codes for normalization: > > library(limma) > > targets <- readTargets("targets.txt") > > x <- read.maimages(targets, source="agilent",green.only=TRUE) > > y.bg <- backgroundCorrect(x, method="normexp") > > y.bgn <- normalizeBetweenArraysy.bg, method="quantile") > > g.ex <- avereps(y.bgn, ID=y.bgn$genes$ProbeName) > > da.norm <- g.ex$E > > > > Here are R session: > > R version 3.0.2 (2013-09-25) > > Platform: x86_64-apple-darwin10.8.0 (64-bit) > > > > locale: > > [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8 > > > > attached base packages: > > [1] graphics grDevices utils datasets stats methods base > > > > other attached packages: > > [1] ggplot2_0.9.3.1 reshape2_1.2.2 plyr_1.8 > > > > loaded via a namespace (and not attached): > > [1] colorspace_1.2-4 dichromat_2.0-0 digest_0.6.3 grid_3.0.2 gtable_0.1.2 labeling_0.2 > > [7] MASS_7.3-29 munsell_0.4.2 proto_0.3-10 RColorBrewer_1.0-5 scales_0.2.3 stringr_0.6.2 > > Best, > > > > chunxuan > > > > > > > > > > _______________________________________________ > > Bioconductor mailing list > > Bioconductor@r-project.org > > https://stat.ethz.ch/mailman/listinfo/bioconductor > > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor > > -- > James W. MacDonald, M.S. > Biostatistician > University of Washington > Environmental and Occupational Health Sciences > 4225 Roosevelt Way NE, # 100 > Seattle WA 98105-6099 [[alternative HTML version deleted]]
ADD REPLY
0
Entering edit mode
Hello Chunuxuan, I have worked with Agilent arrays a lot and can confirm Jim's comment the type of distribution you show (with a heavy right tail) is fairly typical. If you follow Jim's advice of smaller bin sizes/ more bins (nclass =100 or so) you will probably see that there is some mass of the distribution to left of the peak/mode (as you would expect from normexp). I guess what might be confusing you is that the normalisation + logging is supposed to give you normally distributed data (or at least something not so very far away from it) which are symmetrically distributed. But this is a statement about the distribution for the replicates WITHIN genes, not across genes. Best Wishes Claus Dr. Claus-D. Mayer Biomathematics & Statistics Scotland (BioSS) Rowett Institute of Nutrition and Health University of Aberdeen Aberdeen AB21 9SB, Scotland, UK. email: claus at bioss.ac.uk or c.mayer at abdn.ac.uk Telephone: +44 (0) 1224 438652 Biomathematics and Statistics Scotland (BioSS) is formally part of The James Hutton Institute, a registered Scottish charity No. SC041796 and a company limited by guarantee No. SC374831 > -----Original Message----- > From: bioconductor-bounces at r-project.org [mailto:bioconductor- > bounces at r-project.org] On Behalf Of shao chunxuan > Sent: 11 November 2013 15:52 > To: James W. MacDonald > Cc: bioconductor > Subject: Re: [BioC] distribution of agilent array data.? > > Hi Jim, > Thanks for the helpful comments. > I have this question partially because the data has bee been normalized > by other people, in which the distribution is more or less symmetric. > The alternative codes are: > library(limma)targets <- readTargets("targets.txt")x <- > read.maimages(targets, source="agilent",columns=list(R="gMedianSignal", > Rb="gBGMedianSignal", G="gMedianSignal", Gb="gBGMedianSignal"))y.bg <- > backgroundCorrect(x, method="normexp")eset <- y.bg$G ## log2 transform > before normalization!!!eset.l <- round(log2(eset), 4)y.bgn.l.2 <- > normalizeBetweenArrays(eset.l, method="quantile") I got a bell shape > distribution for all probes in a single array if log2 transformed > before normalization. It is an old question whether to log2 first, but > in my data, it doesn't matter, I found that the boxplot for a single > genes across patients are identical, and the signature genes can > separated patients very well in both conditions. > > Best, > Chunuxan > > > > Date: Sun, 10 Nov 2013 19:45:43 -0500 > > From: jmacdon at uw.edu > > To: hibergo at outlook.com > > CC: bioconductor at r-project.org > > Subject: Re: [BioC] distribution of agilent array data.? > > > > Hi Chunxuan, > > > > On Sunday, November 10, 2013 3:19:44 PM, shao chunxuan wrote: > > > > > > > > > > > > > > > > > > > > > Hi everyone, > > > > > > I am confused by the histogram of normalized Agilent microarray > data. > > > It is human single color array, containing around 700 microarrays > and 43K probes. > > > > > > After normalization, I plotted the express value of all probes in > single microarray, one example is attached. > > > > > > I > > > expected to see a more or less symmetric distribution, however, > > > the values seems truncated. In the beginning I thought it may > relate > > > to offset value, but I have tried different value 16, 1, 0, still > > > got similar distribution. > > > > Why would you expect a symmetric distribution? Also, plotting a > > histogram with such large bin sizes isn't very helpful - I wouldn't > be > > willing to say much about the distribution based on that plot anyway. > > > > A more reasonable expectation is something like a convolution of a > > lognormal and an exponential distribution. In other words, there are > > likely a large number of genes that aren't expressed, and the > > distribution of those probes will be symmetrical around some small > > number. And the distribution of expressed genes is likely to be > > something like an exponential, with a long right tail. And since you > > used the normexp background correction, you made the same assumption > > as well. > > > > Best, > > > > Jim > > > > > > > > > > Any explanation or suggestions? > > > > > > Here are codes for normalization: > > > library(limma) > > > targets <- readTargets("targets.txt") x <- read.maimages(targets, > > > source="agilent",green.only=TRUE) y.bg <- backgroundCorrect(x, > > > method="normexp") y.bgn <- normalizeBetweenArraysy.bg, > > > method="quantile") g.ex <- avereps(y.bgn, ID=y.bgn$genes$ProbeName) > > > da.norm <- g.ex$E > > > > > > Here are R session: > > > R version 3.0.2 (2013-09-25) > > > Platform: x86_64-apple-darwin10.8.0 (64-bit) > > > > > > locale: > > > [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8 > > > > > > attached base packages: > > > [1] graphics grDevices utils datasets stats methods > base > > > > > > other attached packages: > > > [1] ggplot2_0.9.3.1 reshape2_1.2.2 plyr_1.8 > > > > > > loaded via a namespace (and not attached): > > > [1] colorspace_1.2-4 dichromat_2.0-0 digest_0.6.3 > grid_3.0.2 gtable_0.1.2 labeling_0.2 > > > [7] MASS_7.3-29 munsell_0.4.2 proto_0.3-10 > RColorBrewer_1.0-5 scales_0.2.3 stringr_0.6.2 > > > Best, > > > > > > chunxuan > > > > > > > > > > > > > > > _______________________________________________ > > > Bioconductor mailing list > > > Bioconductor at r-project.org > > > https://stat.ethz.ch/mailman/listinfo/bioconductor > > > Search the archives: > > > http://news.gmane.org/gmane.science.biology.informatics.conductor > > > > -- > > James W. MacDonald, M.S. > > Biostatistician > > University of Washington > > Environmental and Occupational Health Sciences > > 4225 Roosevelt Way NE, # 100 > > Seattle WA 98105-6099 > > [[alternative HTML version deleted]] The University of Aberdeen is a charity registered in Scotland, No SC013683.
ADD REPLY
0
Entering edit mode
ccshao ▴ 70
@shao-chunxuan-6243
Last seen 4 weeks ago
Germany
Hi Mayer, Thanks for your reply, I fully agree the bell shape curve is observed for a single genes across sample. Best, > From: c.mayer@abdn.ac.uk > To: hibergo@outlook.com; jmacdon@uw.edu > CC: bioconductor@r-project.org > Subject: RE: [BioC] distribution of agilent array data.‏ > Date: Mon, 11 Nov 2013 18:04:31 +0000 > > Hello Chunuxuan, > > I have worked with Agilent arrays a lot and can confirm Jim's comment the type of distribution you show (with a heavy right tail) is fairly typical. If you follow Jim's advice of smaller bin sizes/ more bins (nclass =100 or so) you will probably see that there is some mass of the distribution to left of the peak/mode (as you would expect from normexp). > > I guess what might be confusing you is that the normalisation + logging is supposed to give you normally distributed data (or at least something not so very far away from it) which are symmetrically distributed. But this is a statement about the distribution for the replicates WITHIN genes, not across genes. > > Best Wishes > > Claus > > Dr. Claus-D. Mayer > Biomathematics & Statistics Scotland (BioSS) > Rowett Institute of Nutrition and Health > University of Aberdeen > Aberdeen AB21 9SB, Scotland, UK. > email: claus@bioss.ac.uk or c.mayer@abdn.ac.uk > Telephone: +44 (0) 1224 438652 > > Biomathematics and Statistics Scotland (BioSS) is formally part of The James Hutton Institute, > a registered Scottish charity No. SC041796 and a company limited by guarantee No. SC374831 > > > > -----Original Message----- > > From: bioconductor-bounces@r-project.org [mailto:bioconductor- > > bounces@r-project.org] On Behalf Of shao chunxuan > > Sent: 11 November 2013 15:52 > > To: James W. MacDonald > > Cc: bioconductor > > Subject: Re: [BioC] distribution of agilent array data.‏ > > > > Hi Jim, > > Thanks for the helpful comments. > > I have this question partially because the data has bee been normalized > > by other people, in which the distribution is more or less symmetric. > > The alternative codes are: > > library(limma)targets <- readTargets("targets.txt")x <- > > read.maimages(targets, source="agilent",columns=list(R="gMedianSignal", > > Rb="gBGMedianSignal", G="gMedianSignal", Gb="gBGMedianSignal"))y.bg <- > > backgroundCorrect(x, method="normexp")eset <- y.bg$G ## log2 transform > > before normalization!!!eset.l <- round(log2(eset), 4)y.bgn.l.2 <- > > normalizeBetweenArrays(eset.l, method="quantile") I got a bell shape > > distribution for all probes in a single array if log2 transformed > > before normalization. It is an old question whether to log2 first, but > > in my data, it doesn't matter, I found that the boxplot for a single > > genes across patients are identical, and the signature genes can > > separated patients very well in both conditions. > > > > Best, > > Chunuxan > > > > > > > Date: Sun, 10 Nov 2013 19:45:43 -0500 > > > From: jmacdon@uw.edu > > > To: hibergo@outlook.com > > > CC: bioconductor@r-project.org > > > Subject: Re: [BioC] distribution of agilent array data.þ > > > > > > Hi Chunxuan, > > > > > > On Sunday, November 10, 2013 3:19:44 PM, shao chunxuan wrote: > > > > > > > > > > > > > > > > > > > > > > > > > > > > Hi everyone, > > > > > > > > I am confused by the histogram of normalized Agilent microarray > > data. > > > > It is human single color array, containing around 700 microarrays > > and 43K probes. > > > > > > > > After normalization, I plotted the express value of all probes in > > single microarray, one example is attached. > > > > > > > > I > > > > expected to see a more or less symmetric distribution, however, > > > > the values seems truncated. In the beginning I thought it may > > relate > > > > to offset value, but I have tried different value 16, 1, 0, still > > > > got similar distribution. > > > > > > Why would you expect a symmetric distribution? Also, plotting a > > > histogram with such large bin sizes isn't very helpful - I wouldn't > > be > > > willing to say much about the distribution based on that plot anyway. > > > > > > A more reasonable expectation is something like a convolution of a > > > lognormal and an exponential distribution. In other words, there are > > > likely a large number of genes that aren't expressed, and the > > > distribution of those probes will be symmetrical around some small > > > number. And the distribution of expressed genes is likely to be > > > something like an exponential, with a long right tail. And since you > > > used the normexp background correction, you made the same assumption > > > as well. > > > > > > Best, > > > > > > Jim > > > > > > > > > > > > > > Any explanation or suggestions? > > > > > > > > Here are codes for normalization: > > > > library(limma) > > > > targets <- readTargets("targets.txt") x <- read.maimages(targets, > > > > source="agilent",green.only=TRUE) y.bg <- backgroundCorrect(x, > > > > method="normexp") y.bgn <- normalizeBetweenArraysy.bg, > > > > method="quantile") g.ex <- avereps(y.bgn, ID=y.bgn$genes$ProbeName) > > > > da.norm <- g.ex$E > > > > > > > > Here are R session: > > > > R version 3.0.2 (2013-09-25) > > > > Platform: x86_64-apple-darwin10.8.0 (64-bit) > > > > > > > > locale: > > > > [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8 > > > > > > > > attached base packages: > > > > [1] graphics grDevices utils datasets stats methods > > base > > > > > > > > other attached packages: > > > > [1] ggplot2_0.9.3.1 reshape2_1.2.2 plyr_1.8 > > > > > > > > loaded via a namespace (and not attached): > > > > [1] colorspace_1.2-4 dichromat_2.0-0 digest_0.6.3 > > grid_3.0.2 gtable_0.1.2 labeling_0.2 > > > > [7] MASS_7.3-29 munsell_0.4.2 proto_0.3-10 > > RColorBrewer_1.0-5 scales_0.2.3 stringr_0.6.2 > > > > Best, > > > > > > > > chunxuan > > > > > > > > > > > > > > > > > > > > _______________________________________________ > > > > Bioconductor mailing list > > > > Bioconductor@r-project.org > > > > https://stat.ethz.ch/mailman/listinfo/bioconductor > > > > Search the archives: > > > > http://news.gmane.org/gmane.science.biology.informatics.conductor > > > > > > -- > > > James W. MacDonald, M.S. > > > Biostatistician > > > University of Washington > > > Environmental and Occupational Health Sciences > > > 4225 Roosevelt Way NE, # 100 > > > Seattle WA 98105-6099 > > > > [[alternative HTML version deleted]] > > > > > > The University of Aberdeen is a charity registered in Scotland, No SC013683. [[alternative HTML version deleted]]
ADD COMMENT

Login before adding your answer.

Traffic: 942 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6