Entering edit mode
Dear Diego,
We are interested in introducing Codelink support into limma at some
stage,
so we'd be interested to see your class definition and parser at some
time.
The main reason we don't support Codelink yet is some uncertainty
about how
best to handle the multiple fields (sub-arrays) which are part of the
Codelink platform.
I think that you would almost certainly be better off avoiding missing
values in the first place rather than trying to accommodate them
during the
normalization process. See for example:
https://stat.ethz.ch/pipermail/bioconductor/2005-April/008348.html
Note that while normalizeQuantiles() does allow for missing values,
the
missing values are assumed to have occured at random. If the missing
values
are actually associated with low intensities, this is not missing at
random
and the normalization process will be biased. Even if you use a method
other than quantile normalization, the whole process of between-array
normalization is problematic with missing values. Better to avoid
them.
Gordon
>Date: Fri, 22 Apr 2005 14:45:53 +0200
>From: Diego D?ez Ruiz <ddiez@iib.uam.es>
>Subject: [BioC] Integrating Codelink data with bioconductor (using
> affy and limmafunctions)
>To: bioconductor@stat.math.ethz.ch
>
>Dear BioC users and developers:
>
>I'm working with Codelink plattform microarray data and i want to
apply
>some of the knowledge currently available for other platforms as
>Affymetrix o cDNA spotted microarrays to deal with this data. I have
>been using Affymetrix data for a while and i used suscessfully the
>great affy package. I also worked with spoted cDNA data and use the
>also great package limma mainly. Currently this is what I have done:
>
>1. Write a parser for exported text data from the codelink software.
>
>2. I found convenient to create a new class for storing data from
this
>platform because there are no unique identifiers as in Affymetrix and
so
>an exprSet object was not appropiate. I think an RGlist/MAlist-like
>definition could do the job fine so I almost cut and paste the
classes
>definition in the limma package an modified it a little (any concerns
>about using other's code is wellcome. I'm willing to release this
code
>under LGPL licence) to make a currently named "Codelink" class. The
>class currently stores signal values, flag values, type of spot and
>codelink identifiers. Also I make some annotation packages for the
>human10k, humanwholegenome and ratwholegenome bioarrays but i have to
>test it deeper. A remember and older thread in the BioC list about
>modifiying the exprSet definition to deal with this problem. Is there
>any plan to change it so I can use a more standard approach better
that
>create a new class?
>
>3. Created/adapted some functions for plotting: Density plots,
MAplot,
>Scatter plot that let see the position of diferent type of spots in
>codelink chips: that is FIDUCIAL, DISCOVERY, POSITIVE, etc.
>
>
>All that worked more or less. Data can be imported as RAW data
(default)
>or normalized (Codelink median normalization) and as log2 values
>(default) or not. When I calculate log2 values, i initially decided
to
>set negative values (obtained from subtracting background to signal
>sometimes) to somethign like 0.01. This works fine with
>normalize.quantiles and normalize.loess from affy package but create
>artifact point in a MAplot and a low intensity peak in a density
plot.
>Then, I saw a post from Gordon Smyth (sorry dont have link) telling
that
>limma make negative values NA prior to log2. I wanted to do that and
>modified the functions involved. Finaly this is when I found a mayor
>problem:
>
>1) normalize.loess: I got to modify this function to allow NA values.
>The function compares each array whith the others so if you have 10
>arrays then each array is compared and modified (after loess
>estimation) 10 times. If in each iteration I select the non NA values
>created when you calculate A or M then in each iteration you have a
>different subset of genes used in the normalization process. As a
little
>example, if you have 4 arrays with 5 genes:
>
> chip1 chip2 chip3 chip4
>gen1 1 10 NA 5
>gen2 3 NA 40 6
>gen3 4 NA NA 7
>gen4 3 12 45 6
>gen5 2 1 4 3
>
>normalize.loess take a matrix as argument and do all pairwise
comparisons:
>
>chip1-chip2: used gen1,gen4,gen5 to estimate loess curve.
>chip1-chip3: used gen2,gen4,gen5 to estimate loess curve.
>chip1-chip4: used all genes to estimate loess curve.
>chip2-chip3: used gen4,gen5 to estimate loess curve.
>chip2-chip4: used gen1,gen4,gen5 to estimate loess curve.
>chip3-chip4: used gen2,gen4,gen5 to estimate loess curve.
>
>The code modified/added to allow for NA is between #:
>(from normalize.loess in affy 1.5.8)
>
>---------------------------------------------------------------------
---
>
>normalize.loess <- function (mat, subset = sample(1:(dim(mat)[1]),
>min(c(5000, nrow(mat)))),
> epsilon = 10^-2, maxit = 1, log.it = TRUE, verbose = TRUE,
> span = 2/3, family.loess = "symmetric")
>{
> J <- dim(mat)[2]
> II <- dim(mat)[1]
> newData <- mat
> if log.it) {
> mat <- log2(mat)
> newData <- log2(newData)
> }
> change <- epsilon + 1
> fs <- matrix(0, II, J)
> iter <- 0
> w <- c(0, rep(1, length(subset)), 0)
> while (iter < maxit) {
> iter <- iter + 1
> means <- matrix(0, II, J)
> for (j in 1:(J - 1)) {
> for (k in (j + 1):J) {
> y <- newData[, j] - newData[, k]
> x <- (newData[, j] + newData[, k])/2
> #####################################
> # Select genes that are not set to NA
> sel <- which(!is.na(as.character(y)))
> y <- y[sel]
> x <- x[sel]
> #####################################
> index <- c(order(x)[1], subset, order(-x)[1])
> xx <- x[index]
> yy <- y[index]
> aux <- loess(yy ~ xx, span = span, degree = 1,
> weights = w, family = family.loess)
> aux <- predict(aux, data.frame(xx = x))/J
> #####################################
> # Apply correction to genes not NA.
> means[sel, j] <- means[sel, j] + aux
> means[sel, k] <- means[sel, k] - aux
> #####################################
> if (verbose)
> cat("Done with", j, "vs", k, " in iteration ",
> iter, "\n")
> }
> }
> fs <- fs + means
> newData <- mat - fs
> change <- max(colMeans((means[subset, ])^2))
> if (verbose)
> cat(iter, change, "\n")
> oldfs <- fs
> }
> if ((change > epsilon) & (maxit > 1))
> warning(paste("No convergence after", maxit,
"iterations.\n"))
> if log.it) {
> return(2^newData)
> }
> else return(newData)
>}
>
>---------------------------------------------------------------------
---
>
>I'm not an statitician so not sure if this could be a correct
approach:
>Is this correct or I cannot do that?
>
>2) With normalize.quantiles all seems to work fine... the function
don't
>say nothing about NA values and the density plots seems correct
(almost
>same density plot for all chips) but the MAplot have changed greatly
>with a great number of genes with greater values of M (in absolute
>value) that is, a wider cloud of points. I don't know why is this
>happening. I also tried to see at the C code for normalize.quantiles
(at
>qnorm.c) but not found an easy answer. Hopefully I finally found that
I
>can use normalizeQuantiles from limma package that allows for NA
values.
>
>By the way, is this really necesary or can I assign a low value to
>negative ones?. I think it is better to use NA but...
>Something I'm going to do sonner is to import all the values obtained
>from the Codelink data (mean foreground and background, median
>foreground and background, etc) so it could be possible to calculate
a
>signal value within R and then use a method that avoids negative
values.
>But that could be the same in some cases to assign a small value (Or
I'm
>wrong)
>
>I'm working with a linux box (Debian Sarge) with R 2.0.1 compiled
from
>sources and BioC 1.5.
>
>Any comment on all that would be very appreciated.
>
>Thanks in advanced!
>
>
>D.