Peculiar behaviour of normalize.quantiles (in affy, preprocessCore) if there are NA data

0

Entering edit mode

Wolfgang Huber ★ 13k

@wolfgang-huber-3550

Last seen 24 days ago

EMBL European Molecular Biology Laborat…

Hi all, I noted a peculiar result from using quantile normalisation on a data matrix that contained NA values. It creates a rather artifactual- looking distribution of the resulting data, and I wonder whether: - this is desired, - if not, how it can be fixed, - in either case, whether this is a point of general interest for people that interpret distributions of their e.g. microarray data. Here is some example code to reproduce: library("geneplotter") library("preprocessCore") set.seed(0xbeef) x = matrix(as.numeric(NA), nrow=20000, ncol=2) for(i in 1:ncol(x)) x[,i] = c(rnorm(10000), runif(10000)*10) x[ sample(nrow(x), 1000), ncol(x)] = NA qx = normalize.quantiles(x) par(mfrow=c(2,2)) for(what in c("x", "qx")) for(i in 1:2) hist(get(what)[,i], breaks=seq(-5,10, length=75), main=sprintf("%s[,%d]", what, i), col="orange", xlab="") The resulting plot is here http://www.ebi.ac.uk/~huber/quantilenormalisation/normalize.quantiles. png I noted in the implementation in preprocessCore/src/qnorm.c that no special consideration is made for NA values, maybe does this confuse the algorithm? R version 2.6.0 Under development (unstable) (2007-07-10 r42165) x86_64-unknown-linux-gnu locale: LC_CTYPE=en_GB.UTF-8;LC_NUMERIC=C;LC_TIME=en_GB.UTF-8;LC_COLLATE=en_GB .UTF-8;LC_MONETARY=en_GB.UTF-8;LC_MESSAGES=en_GB.UTF-8;LC_PAPER=en_GB. UTF-8;LC_NAME=C;LC_ADDRESS=C;LC_TELEPHONE=C;LC_MEASUREMENT=en_GB.UTF-8 ;LC_IDENTIFICATION=C attached base packages: [1] tools stats graphics grDevices datasets utils methods [8] base other attached packages: [1] preprocessCore_0.99.8 geneplotter_1.15.1 lattice_0.16-1 [4] annotate_1.15.2 AnnotationDbi_0.0.78 RSQLite_0.5-4 [7] DBI_0.2-3 Biobase_1.15.17 fortunes_1.3-3 loaded via a namespace (and not attached): [1] grid_2.6.0 KernSmooth_2.22-20 RColorBrewer_0.2-3 > Best wishes Wolfgang ------------------------------------------------------------------ Wolfgang Huber EBI/EMBL Cambridge UK http://www.ebi.ac.uk/huber

Microarray Microarray • 801 views

ADD COMMENT • link updated 16.8 years ago by Ben Bolstad ★ 1.2k • written 16.8 years ago by Wolfgang Huber ★ 13k

0

Entering edit mode

Ben Bolstad ★ 1.2k

@ben-bolstad-1494

Last seen 6.7 years ago

Wolfgang, The code in preprocessCore for quantile normalization shows its legacy being that it was developed around probe-level Affymetrix data straight from CEL files where NA values are not to be expected. There may or may not be comments to that effect in the C code documentation (actually there is further down in the qnorm.c file for a slight variation on the implementation). If you are willing to make the assumption that the missing data mechanism is "missing at random" then I think the fix is fairly trivial, just estimate the distribution using the non-missing data. If it is instead driven by say a truncation mechanism a different fix would be needed. In either case I don't think the current situation is desirable and should be fixed. Best, Ben On Tue, 2007-07-10 at 18:35 +0100, Wolfgang Huber wrote: > Hi all, > > I noted a peculiar result from using quantile normalisation on a data > matrix that contained NA values. It creates a rather artifactual- looking > distribution of the resulting data, and I wonder whether: > - this is desired, > - if not, how it can be fixed, > - in either case, whether this is a point of general interest for people > that interpret distributions of their e.g. microarray data. > > Here is some example code to reproduce: > > > > library("geneplotter") > library("preprocessCore") > > set.seed(0xbeef) > > x = matrix(as.numeric(NA), nrow=20000, ncol=2) > for(i in 1:ncol(x)) > x[,i] = c(rnorm(10000), runif(10000)*10) > x[ sample(nrow(x), 1000), ncol(x)] = NA > > qx = normalize.quantiles(x) > > par(mfrow=c(2,2)) > > for(what in c("x", "qx")) > for(i in 1:2) > hist(get(what)[,i], breaks=seq(-5,10, length=75), > main=sprintf("%s[,%d]", what, i), > col="orange", xlab="") > > > > > > The resulting plot is here > http://www.ebi.ac.uk/~huber/quantilenormalisation/normalize.quantile s.png > > I noted in the implementation in preprocessCore/src/qnorm.c that no > special consideration is made for NA values, maybe does this confuse the > algorithm? > > > R version 2.6.0 Under development (unstable) (2007-07-10 r42165) > x86_64-unknown-linux-gnu > > locale: > LC_CTYPE=en_GB.UTF-8;LC_NUMERIC=C;LC_TIME=en_GB.UTF-8;LC_COLLATE=en_ GB.UTF-8;LC_MONETARY=en_GB.UTF-8;LC_MESSAGES=en_GB.UTF-8;LC_PAPER=en_G B.UTF-8;LC_NAME=C;LC_ADDRESS=C;LC_TELEPHONE=C;LC_MEASUREMENT=en_GB.UTF -8;LC_IDENTIFICATION=C > > attached base packages: > [1] tools stats graphics grDevices datasets utils methods > [8] base > > other attached packages: > [1] preprocessCore_0.99.8 geneplotter_1.15.1 lattice_0.16-1 > [4] annotate_1.15.2 AnnotationDbi_0.0.78 RSQLite_0.5-4 > [7] DBI_0.2-3 Biobase_1.15.17 fortunes_1.3-3 > > loaded via a namespace (and not attached): > [1] grid_2.6.0 KernSmooth_2.22-20 RColorBrewer_0.2-3 > > > > > Best wishes > Wolfgang > > ------------------------------------------------------------------ > Wolfgang Huber EBI/EMBL Cambridge UK http://www.ebi.ac.uk/huber > > _______________________________________________ > Bioconductor mailing list > Bioconductor at stat.math.ethz.ch > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor

ADD COMMENT • link 16.8 years ago Ben Bolstad ★ 1.2k

Login before adding your answer.