Question

merge SMD data from different print batch

0

Entering edit mode

Adaikalavan Ramasamy ★ 1.8k

@adaikalavan-ramasamy-675

Last seen 10.9 years ago

Dear all, I need some advice on certain potentially ad-hoc measure that I am using to deal with Stanford Microarray Database (SMD) data. Problem : A single study in SMD often utilises different print batches - sometimes with twice as many unique features in one batch than other batches. This is compounded by the fact there are multiple copies for some genes. Current solution : I am aware that marray has read.SMD(), limma has read.maimagenes( ..., source="smd"), merge.MAList() and someone else proposed read.SMD2() on the mailing list. I could be wrong here I believe these assume that the arrays contain the same probeset but potentially disordered. Potential solution : My steps are as following. Scripts to do this are given at the end. 1) First we preprocess each array with LOESS followed by scale normalisation before combining with other arrays. This is done is the first half of the function my.SMD.expr(). 2) Next, we average the log ratios over the LUID or SUID (for old SMD dataset) and removing redundant gene annotations. This is done in get.SMD.expr(). This is potentially the a contentious issue. 3) Finally we merge the different arrays by using the LUID and the average gene expression for that LUID that was calculated in step 2. Is this complete nonsense ? Can someone please shed some light. Unfortunately, I did not expect SMD datasets to be this complicated. Regards, Adai ---------------------------------------------------------------------- -------- get.SMD.expr <- function(input, by, summary.fn=mean){ tmp <- lapply( input, function(m) m$genes ) genes <- do.call("rbind", tmp); good <- !duplicated( genes[ , by] ) genes <- genes[ good, c(by, "CLONE ID", "GENE SYMBOL", "CLUSTER ID", "ACCESSION", "GENE NAME")] rownames(genes) <- genes[ , by] cn <- sapply( input, function(m) m$target$FileName ) M <- matrix( nr=nrow(genes), nc=length(input), dimnames=list( rownames(genes), cn ) ) for(i in 1:length(input)){ key <- as.character( input[[i]]$genes[ , by] ) val <- as.numeric( input[[i]]$M ) out <- tapply( val, key, summary.fn, na.rm=TRUE ) M[ names(out), i ] <- out cat(i, "\t") } cat("\n") return( list(M=M, genes=genes) ) } my.SMD.expr <- function( files, source=c("smd", "smd.old") ){ require(limma) raw <- list(NULL) for(i in 1:length(files)){ tmp1 <- read.maimages( files=files[i], source=source ) tmp2 <- normalizeWithinArrays( tmp1, method="loess" ) tmp3 <- normalizeBetweenArrays( tmp2, method="scale" ) colnames(tmp3$genes) <- toupper( colnames(tmp3$genes) ) raw[[i]] <- tmp3 rm(tmp1, tmp2, tmp3) } names(raw) <- files if(source=="smd") by <- "LUID" if(source=="smd.old") by <- "SUID" z <- get.SMD.expr( raw, by=by ) }

Microarray limma marray Microarray limma marray • 1.2k views

ADD COMMENT • link updated 19.3 years ago by Jeremy Gollub ▴ 60 • written 19.3 years ago by Adaikalavan Ramasamy ★ 1.8k

score 0 · Answer 1 · 2006-04-02

Adaikalavan Ramasamy wrote: > > 1) First we preprocess each array with LOESS followed by scale > normalisation before combining with other arrays. This is done is the > first half of the function my.SMD.expr(). Generally reasonable for an expression experiment in SMD, I think, although all of the usual concerns about normalization apply. There are some very small arrays which might benefit from different treatment, but at a guess you're probably dealing with human cancer data...? > 2) Next, we average the log ratios over the LUID or SUID (for old SMD > dataset) and removing redundant gene annotations. This is done in > get.SMD.expr(). This is potentially the a contentious issue. This is generally the way it's done by SMD users. I think some of the regulars on this list prefer different approaches. I'm not certain why you'd use LUID for newer data? SUID = Sequence Unique ID. An SUID should (and almost always does) refer to a single "reporter" DNA sequence (oligo, PCR product, or cDNA clone spotted on an array). The same cDNA, on multiple arrays, should always have the same SUID within SMD, regardless of the fabrication history. Identification is by public accession (e.g., GenBank for human ESTs) whenever possible. LUID = Laboratory Unique ID. This refers to the product of a distinct preparation, e.g. a single microtiter plate well for cDNA clones, or a single synthesis of a particular oligo. When it's available at all, LUID should be entirely nested within SUID; i.e., a single oligo in SMD should have a single SUID, but might have several LUIDs if it was synthesized multiple times or in different places. LUID is generally used to track fabrication process quality issues, such as discovery and isolation of a contaminated or mislabeled set of spots. If you're not concerned about or interested in QA issues, you probably don't want to work in LUID space. > 3) Finally we merge the different arrays by using the LUID and the > average gene expression for that LUID that was calculated in step 2. Subject to the note above about the interpretation of LUIDs, this is standard practice in SMD and probably pretty reasonable. - Jeremy (former SMD developer) -- Jeremy Gollub jeremy at gollub.net