Question about quantile normalization and NA value

0

Entering edit mode

H@mamba.fhcrc.org ▴ 10

@hmambafhcrcorg-6345

Last seen 9.6 years ago

Dear all, I have a quation about quantile normalization and NA value. I'm going to normalize the microarray data by "normalizeBetweenArrays" which is the quantile normalization function in "limma" package. I normalized a data with NA as follows: > x <- matrix(c(100,15,200,250,110,16.5,220,275,120,18,240,300),4,3) > colnames(x) <- paste("Chip",1:3, sep="") > rownames(x) <- c("RNA-A","RNA-B","RNA-C","RNA-D") > > x Chip1 Chip2 Chip3 RNA-A 100 110.0 120 RNA-B 15 16.5 18 RNA-C 200 220.0 240 RNA-D 250 275.0 300 > > normalizeBetweenArrays(x) Chip1 Chip2 Chip3 RNA-A 110.0 110.0 110.0 RNA-B 16.5 16.5 16.5 RNA-C 220.0 220.0 220.0 RNA-D 275.0 275.0 275.0 > > y <- x > y[2,2] <- NA > > normalizeBetweenArrays(y) Chip1 Chip2 Chip3 RNA-A 134.44444 47.66667 134.44444 RNA-B 47.66667 NA 47.66667 RNA-C 226.11111 180.27778 226.11111 RNA-D 275.00000 275.00000 275.00000 I asuume the normalized y is a bit far away from normalized y. Does only one NA induce this large effect ? Should I normalize after replacing NA with some value, such as median(x[2,],na.rm=T) ? My environment is limma Version 3.16.6, R version 3.0.1. Thanks -- output of sessionInfo(): Dear all, I have a quation about quantile normalization and NA value. I'm going to normalize the microarray data by "normalizeBetweenArrays" which is the quantile normalization function in "limma" package. I normalized a data with NA as follows: > x <- matrix(c(100,15,200,250,110,16.5,220,275,120,18,240,300),4,3) > colnames(x) <- paste("Chip",1:3, sep="") > rownames(x) <- c("RNA-A","RNA-B","RNA-C","RNA-D") > > x Chip1 Chip2 Chip3 RNA-A 100 110.0 120 RNA-B 15 16.5 18 RNA-C 200 220.0 240 RNA-D 250 275.0 300 > > normalizeBetweenArrays(x) Chip1 Chip2 Chip3 RNA-A 110.0 110.0 110.0 RNA-B 16.5 16.5 16.5 RNA-C 220.0 220.0 220.0 RNA-D 275.0 275.0 275.0 > > y <- x > y[2,2] <- NA > > normalizeBetweenArrays(y) Chip1 Chip2 Chip3 RNA-A 134.44444 47.66667 134.44444 RNA-B 47.66667 NA 47.66667 RNA-C 226.11111 180.27778 226.11111 RNA-D 275.00000 275.00000 275.00000 I asuume the normalized y is a bit far away from normalized y. Does only one NA induce this large effect ? Should I normalize after replacing NA with some value, such as median(x[2,],na.rm=T) ? My environment is limma Version 3.16.6, R version 3.0.1. Thanks -- Sent via the guest posting facility at bioconductor.org.

Microarray Normalization limma Microarray Normalization limma • 4.5k views

ADD COMMENT • link updated 10.3 years ago by godahajime ▴ 20 • written 10.3 years ago by H@mamba.fhcrc.org ▴ 10

0

Entering edit mode

Steve Lianoglou ★ 13k

@steve-lianoglou-2771

Last seen 14 months ago

United States

Hi, On Tue, Jan 21, 2014 at 5:03 AM, <h at="" mamba.fhcrc.org=""> wrote: > > Dear all, > > I have a quation about quantile normalization and NA value. > > I'm going to normalize the microarray data by "normalizeBetweenArrays" which is the quantile normalization function in "limma" package. > I normalized a data with NA as follows: > >> x <- matrix(c(100,15,200,250,110,16.5,220,275,120,18,240,300),4,3) >> colnames(x) <- paste("Chip",1:3, sep="") >> rownames(x) <- c("RNA-A","RNA-B","RNA-C","RNA-D") >> >> x > Chip1 Chip2 Chip3 > RNA-A 100 110.0 120 > RNA-B 15 16.5 18 > RNA-C 200 220.0 240 > RNA-D 250 275.0 300 >> >> normalizeBetweenArrays(x) > Chip1 Chip2 Chip3 > RNA-A 110.0 110.0 110.0 > RNA-B 16.5 16.5 16.5 > RNA-C 220.0 220.0 220.0 > RNA-D 275.0 275.0 275.0 >> >> y <- x >> y[2,2] <- NA >> >> normalizeBetweenArrays(y) > Chip1 Chip2 Chip3 > RNA-A 134.44444 47.66667 134.44444 > RNA-B 47.66667 NA 47.66667 > RNA-C 226.11111 180.27778 226.11111 > RNA-D 275.00000 275.00000 275.00000 > > > I asuume the normalized y is a bit far away from normalized y. Does only one NA induce this large effect ? I suspect that this is only because you are doing the normalization over a very small dataset. With four observations per "array", 25% of your data on chip2 is missing ... so a change in a single datapoint has a larger affect than it would on your real array (which would have thousands of observations per array). Of course, if 25% of your real arrays have NA values, you might consider failing that array anyway ;-) > Should I normalize after replacing NA with some value, such as median(x[2,],na.rm=T) ? I'd think not. If you are analyzing commercial array, just stick with the prescribed steps you find in some of the many tutorials available (in limma or other bioc tutorials). If you have a custom array, more care will be needed. -steve -- Steve Lianoglou Computational Biologist Genentech

ADD COMMENT • link 10.3 years ago Steve Lianoglou ★ 13k

0

Entering edit mode

godahajime ▴ 20

@godahajime-6349

Last seen 9.6 years ago

Dr Steve Lianoglou, Thanks for your reply. The sample size is too small as you mentioned. That matter may be left out of consideration because the actuall sample size is over 2000x300. I read the tutorial of limma and the source code of "normalizeBetweenArrays", however, I couldn't understand how NA values were processed. Could you show me the prodess? Thanks, [[alternative HTML version deleted]]

ADD COMMENT • link 10.3 years ago godahajime ▴ 20

0

Entering edit mode

Hi, On Wed, Jan 22, 2014 at 3:43 AM, godahajime <godahajime at="" zoho.com=""> wrote: > Dr Steve Lianoglou, > > Thanks for your reply. > > The sample size is too small as you mentioned. > That matter may be left out of consideration because the actuall sample size is over 2000x300. > > I read the tutorial of limma and the source code of "normalizeBetweenArrays", however, I couldn't understand how NA values were processed. > Could you show me the prodess? They are handled "very carefully" ;-) The function that actually does the quantile normalization is limma::normalizeQuantiles. If you *really* want to understand what is happening there, I suggest you: (1) download the source code for limma (2) open the limma/R/norm.R file and jump to the `normalizeQuantiles` function. (3) reconstruct the parameters required to run the function, ie: (a) Create a test matrix with some (5) data points missing: R> A <- matrix(rnorm(50), nrow=10) R> A[sample(50, 5)] <- NA (b) Create a `ties` variable: R> ties <- TRUE (4) Now step through the code As you step through the code, take a careful look at what each line produces -- you will likely get tripped up by some of the code there, but read the documentation (I'm sure you will have to read ?approx, for instance) If you really care to know how NA's are accounted for, that's how you would go about doing it. Others are happy enough to know that they are more or less ignored and accounted for, and that's that. It is a good exercise to do for yourself, either way, as performing these exercises for several different "well travelled" packages is a great way to learn how to code in R, as well as tricks-of-the-trade related to programming/computing w/ data in general. Enjoy, -steve -- Steve Lianoglou Computational Biologist Genentech

ADD REPLY • link 10.3 years ago Steve Lianoglou ★ 13k

0

Entering edit mode

godahajime ▴ 20

@godahajime-6349

Last seen 9.6 years ago

Dr Lianoglou, I truly appreciate your kind response. It seems approx() is beyound my capacity, however, I will try and challenge myself to that. Professor Smyth, Phd Bolstad, I have treated FLAG spot as NA . However, supporsing that the intensities of FLAG spots are reliable to a certain degree, I might leave them intact. Thanks, [[alternative HTML version deleted]]

ADD COMMENT • link 10.3 years ago godahajime ▴ 20

0

Entering edit mode

Gordon Smyth 50k

@gordon-smyth

Last seen 6 hours ago

WEHI, Melbourne, Australia

The meaning of quantile normalization with NAs have never been agreed on in a refereed publication, as far as I know. I implemented the limma version long ago, and as far as I know it was the first implementation of quantile normalization to allow NAs. Ben Bolstad implemented a somewhat different algorithm in the affy package. Ben's version is now in the preprocessCore package as normalize.quantiles(). The result you have is correct according to limma's algorithm, which involves interpolating each column of non-missing values out a full length vector when computing the mean quantiles. The reason the NA makes a big difference is that it changes the minimum quantile for column 2 from 16.5 to 110, a big change. As an alternative, you might try Ben's algorithm: library(proprocessCore) normalize.quantiles(y) But replacing NAs with row medians would not in general be sufficient. Best wishes Gordon > Date: Tue, 21 Jan 2014 05:03:17 -0800 (PST) > From: H at mamba.fhcrc.org, "K [guest]" <guest at="" bioconductor.org=""> > To: bioconductor at r-project.org, godahajime at zoho.com > Subject: [BioC] Question about quantile normalization and NA value > > > Dear all, > > I have a quation about quantile normalization and NA value. > > I'm going to normalize the microarray data by "normalizeBetweenArrays" which is the quantile normalization function in "limma" package. > I normalized a data with NA as follows: > >> x <- matrix(c(100,15,200,250,110,16.5,220,275,120,18,240,300),4,3) >> colnames(x) <- paste("Chip",1:3, sep="") >> rownames(x) <- c("RNA-A","RNA-B","RNA-C","RNA-D") >> >> x > Chip1 Chip2 Chip3 > RNA-A 100 110.0 120 > RNA-B 15 16.5 18 > RNA-C 200 220.0 240 > RNA-D 250 275.0 300 >> >> normalizeBetweenArrays(x) > Chip1 Chip2 Chip3 > RNA-A 110.0 110.0 110.0 > RNA-B 16.5 16.5 16.5 > RNA-C 220.0 220.0 220.0 > RNA-D 275.0 275.0 275.0 >> >> y <- x >> y[2,2] <- NA >> >> normalizeBetweenArrays(y) > Chip1 Chip2 Chip3 > RNA-A 134.44444 47.66667 134.44444 > RNA-B 47.66667 NA 47.66667 > RNA-C 226.11111 180.27778 226.11111 > RNA-D 275.00000 275.00000 275.00000 > > > I asuume the normalized y is a bit far away from normalized y. Does only one NA induce this large effect ? > Should I normalize after replacing NA with some value, such as median(x[2,],na.rm=T) ? > My environment is limma Version 3.16.6, R version 3.0.1. > > Thanks ______________________________________________________________________ The information in this email is confidential and intend...{{dropped:4}}

ADD COMMENT • link 10.3 years ago Gordon Smyth 50k

0

Entering edit mode

At least for the example matrix below, you?ll find the preprocessCore normalize.quantiles() function will generate you the same result as below from limma. Though I make no claims that it is identical in other cases, nor that its treatment of NA is better than any other implementations. Best, Ben On Jan 22, 2014, at 7:16 PM, Gordon K Smyth <smyth at="" wehi.edu.au=""> wrote: > The meaning of quantile normalization with NAs have never been agreed on in a refereed publication, as far as I know. I implemented the limma version long ago, and as far as I know it was the first implementation of quantile normalization to allow NAs. Ben Bolstad implemented a somewhat different algorithm in the affy package. Ben's version is now in the preprocessCore package as normalize.quantiles(). > > The result you have is correct according to limma's algorithm, which involves interpolating each column of non-missing values out a full length vector when computing the mean quantiles. The reason the NA makes a big difference is that it changes the minimum quantile for column 2 from 16.5 to 110, a big change. As an alternative, you might try Ben's algorithm: > > library(proprocessCore) > normalize.quantiles(y) > > But replacing NAs with row medians would not in general be sufficient. > > Best wishes > Gordon > >> Date: Tue, 21 Jan 2014 05:03:17 -0800 (PST) >> From: H at mamba.fhcrc.org, "K [guest]" <guest at="" bioconductor.org=""> >> To: bioconductor at r-project.org, godahajime at zoho.com >> Subject: [BioC] Question about quantile normalization and NA value >> >> >> Dear all, >> >> I have a quation about quantile normalization and NA value. >> >> I'm going to normalize the microarray data by "normalizeBetweenArrays" which is the quantile normalization function in "limma" package. >> I normalized a data with NA as follows: >> >>> x <- matrix(c(100,15,200,250,110,16.5,220,275,120,18,240,300),4,3) >>> colnames(x) <- paste("Chip",1:3, sep="") >>> rownames(x) <- c("RNA-A","RNA-B","RNA-C","RNA-D") >>> >>> x >> Chip1 Chip2 Chip3 >> RNA-A 100 110.0 120 >> RNA-B 15 16.5 18 >> RNA-C 200 220.0 240 >> RNA-D 250 275.0 300 >>> >>> normalizeBetweenArrays(x) >> Chip1 Chip2 Chip3 >> RNA-A 110.0 110.0 110.0 >> RNA-B 16.5 16.5 16.5 >> RNA-C 220.0 220.0 220.0 >> RNA-D 275.0 275.0 275.0 >>> >>> y <- x >>> y[2,2] <- NA >>> >>> normalizeBetweenArrays(y) >> Chip1 Chip2 Chip3 >> RNA-A 134.44444 47.66667 134.44444 >> RNA-B 47.66667 NA 47.66667 >> RNA-C 226.11111 180.27778 226.11111 >> RNA-D 275.00000 275.00000 275.00000 >> >> >> I asuume the normalized y is a bit far away from normalized y. Does only one NA induce this large effect ? >> Should I normalize after replacing NA with some value, such as median(x[2,],na.rm=T) ? >> My environment is limma Version 3.16.6, R version 3.0.1. >> >> Thanks > > ______________________________________________________________________ > The information in this email is confidential and intend...{{dropped:4}} > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor

ADD REPLY • link 10.3 years ago Ben Bolstad ★ 1.2k

Login before adding your answer.