Peculiar behaviour of normalize.quantiles (in affy, preprocessCore) if there are NA data
1
0
Entering edit mode
@wolfgang-huber-3550
Last seen 24 days ago
EMBL European Molecular Biology Laborat…
Hi all, I noted a peculiar result from using quantile normalisation on a data matrix that contained NA values. It creates a rather artifactual- looking distribution of the resulting data, and I wonder whether: - this is desired, - if not, how it can be fixed, - in either case, whether this is a point of general interest for people that interpret distributions of their e.g. microarray data. Here is some example code to reproduce: library("geneplotter") library("preprocessCore") set.seed(0xbeef) x = matrix(as.numeric(NA), nrow=20000, ncol=2) for(i in 1:ncol(x)) x[,i] = c(rnorm(10000), runif(10000)*10) x[ sample(nrow(x), 1000), ncol(x)] = NA qx = normalize.quantiles(x) par(mfrow=c(2,2)) for(what in c("x", "qx")) for(i in 1:2) hist(get(what)[,i], breaks=seq(-5,10, length=75), main=sprintf("%s[,%d]", what, i), col="orange", xlab="") The resulting plot is here http://www.ebi.ac.uk/~huber/quantilenormalisation/normalize.quantiles. png I noted in the implementation in preprocessCore/src/qnorm.c that no special consideration is made for NA values, maybe does this confuse the algorithm? R version 2.6.0 Under development (unstable) (2007-07-10 r42165) x86_64-unknown-linux-gnu locale: LC_CTYPE=en_GB.UTF-8;LC_NUMERIC=C;LC_TIME=en_GB.UTF-8;LC_COLLATE=en_GB .UTF-8;LC_MONETARY=en_GB.UTF-8;LC_MESSAGES=en_GB.UTF-8;LC_PAPER=en_GB. UTF-8;LC_NAME=C;LC_ADDRESS=C;LC_TELEPHONE=C;LC_MEASUREMENT=en_GB.UTF-8 ;LC_IDENTIFICATION=C attached base packages: [1] tools stats graphics grDevices datasets utils methods [8] base other attached packages: [1] preprocessCore_0.99.8 geneplotter_1.15.1 lattice_0.16-1 [4] annotate_1.15.2 AnnotationDbi_0.0.78 RSQLite_0.5-4 [7] DBI_0.2-3 Biobase_1.15.17 fortunes_1.3-3 loaded via a namespace (and not attached): [1] grid_2.6.0 KernSmooth_2.22-20 RColorBrewer_0.2-3 > Best wishes Wolfgang ------------------------------------------------------------------ Wolfgang Huber EBI/EMBL Cambridge UK http://www.ebi.ac.uk/huber
Microarray Microarray • 801 views
ADD COMMENT
0
Entering edit mode
Ben Bolstad ★ 1.2k
@ben-bolstad-1494
Last seen 6.7 years ago
Wolfgang, The code in preprocessCore for quantile normalization shows its legacy being that it was developed around probe-level Affymetrix data straight from CEL files where NA values are not to be expected. There may or may not be comments to that effect in the C code documentation (actually there is further down in the qnorm.c file for a slight variation on the implementation). If you are willing to make the assumption that the missing data mechanism is "missing at random" then I think the fix is fairly trivial, just estimate the distribution using the non-missing data. If it is instead driven by say a truncation mechanism a different fix would be needed. In either case I don't think the current situation is desirable and should be fixed. Best, Ben On Tue, 2007-07-10 at 18:35 +0100, Wolfgang Huber wrote: > Hi all, > > I noted a peculiar result from using quantile normalisation on a data > matrix that contained NA values. It creates a rather artifactual- looking > distribution of the resulting data, and I wonder whether: > - this is desired, > - if not, how it can be fixed, > - in either case, whether this is a point of general interest for people > that interpret distributions of their e.g. microarray data. > > Here is some example code to reproduce: > > > > library("geneplotter") > library("preprocessCore") > > set.seed(0xbeef) > > x = matrix(as.numeric(NA), nrow=20000, ncol=2) > for(i in 1:ncol(x)) > x[,i] = c(rnorm(10000), runif(10000)*10) > x[ sample(nrow(x), 1000), ncol(x)] = NA > > qx = normalize.quantiles(x) > > par(mfrow=c(2,2)) > > for(what in c("x", "qx")) > for(i in 1:2) > hist(get(what)[,i], breaks=seq(-5,10, length=75), > main=sprintf("%s[,%d]", what, i), > col="orange", xlab="") > > > > > > The resulting plot is here > http://www.ebi.ac.uk/~huber/quantilenormalisation/normalize.quantile s.png > > I noted in the implementation in preprocessCore/src/qnorm.c that no > special consideration is made for NA values, maybe does this confuse the > algorithm? > > > R version 2.6.0 Under development (unstable) (2007-07-10 r42165) > x86_64-unknown-linux-gnu > > locale: > LC_CTYPE=en_GB.UTF-8;LC_NUMERIC=C;LC_TIME=en_GB.UTF-8;LC_COLLATE=en_ GB.UTF-8;LC_MONETARY=en_GB.UTF-8;LC_MESSAGES=en_GB.UTF-8;LC_PAPER=en_G B.UTF-8;LC_NAME=C;LC_ADDRESS=C;LC_TELEPHONE=C;LC_MEASUREMENT=en_GB.UTF -8;LC_IDENTIFICATION=C > > attached base packages: > [1] tools stats graphics grDevices datasets utils methods > [8] base > > other attached packages: > [1] preprocessCore_0.99.8 geneplotter_1.15.1 lattice_0.16-1 > [4] annotate_1.15.2 AnnotationDbi_0.0.78 RSQLite_0.5-4 > [7] DBI_0.2-3 Biobase_1.15.17 fortunes_1.3-3 > > loaded via a namespace (and not attached): > [1] grid_2.6.0 KernSmooth_2.22-20 RColorBrewer_0.2-3 > > > > > Best wishes > Wolfgang > > ------------------------------------------------------------------ > Wolfgang Huber EBI/EMBL Cambridge UK http://www.ebi.ac.uk/huber > > _______________________________________________ > Bioconductor mailing list > Bioconductor at stat.math.ethz.ch > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
ADD COMMENT

Login before adding your answer.

Traffic: 744 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6