merge SMD data from different print batch
1
0
Entering edit mode
@adaikalavan-ramasamy-675
Last seen 9.7 years ago
Dear all, I need some advice on certain potentially ad-hoc measure that I am using to deal with Stanford Microarray Database (SMD) data. Problem : A single study in SMD often utilises different print batches - sometimes with twice as many unique features in one batch than other batches. This is compounded by the fact there are multiple copies for some genes. Current solution : I am aware that marray has read.SMD(), limma has read.maimagenes( ..., source="smd"), merge.MAList() and someone else proposed read.SMD2() on the mailing list. I could be wrong here I believe these assume that the arrays contain the same probeset but potentially disordered. Potential solution : My steps are as following. Scripts to do this are given at the end. 1) First we preprocess each array with LOESS followed by scale normalisation before combining with other arrays. This is done is the first half of the function my.SMD.expr(). 2) Next, we average the log ratios over the LUID or SUID (for old SMD dataset) and removing redundant gene annotations. This is done in get.SMD.expr(). This is potentially the a contentious issue. 3) Finally we merge the different arrays by using the LUID and the average gene expression for that LUID that was calculated in step 2. Is this complete nonsense ? Can someone please shed some light. Unfortunately, I did not expect SMD datasets to be this complicated. Regards, Adai ---------------------------------------------------------------------- -------- get.SMD.expr <- function(input, by, summary.fn=mean){ tmp <- lapply( input, function(m) m$genes ) genes <- do.call("rbind", tmp); good <- !duplicated( genes[ , by] ) genes <- genes[ good, c(by, "CLONE ID", "GENE SYMBOL", "CLUSTER ID", "ACCESSION", "GENE NAME")] rownames(genes) <- genes[ , by] cn <- sapply( input, function(m) m$target$FileName ) M <- matrix( nr=nrow(genes), nc=length(input), dimnames=list( rownames(genes), cn ) ) for(i in 1:length(input)){ key <- as.character( input[[i]]$genes[ , by] ) val <- as.numeric( input[[i]]$M ) out <- tapply( val, key, summary.fn, na.rm=TRUE ) M[ names(out), i ] <- out cat(i, "\t") } cat("\n") return( list(M=M, genes=genes) ) } my.SMD.expr <- function( files, source=c("smd", "smd.old") ){ require(limma) raw <- list(NULL) for(i in 1:length(files)){ tmp1 <- read.maimages( files=files[i], source=source ) tmp2 <- normalizeWithinArrays( tmp1, method="loess" ) tmp3 <- normalizeBetweenArrays( tmp2, method="scale" ) colnames(tmp3$genes) <- toupper( colnames(tmp3$genes) ) raw[[i]] <- tmp3 rm(tmp1, tmp2, tmp3) } names(raw) <- files if(source=="smd") by <- "LUID" if(source=="smd.old") by <- "SUID" z <- get.SMD.expr( raw, by=by ) }
Microarray limma marray Microarray limma marray • 1.0k views
ADD COMMENT
0
Entering edit mode
@jeremy-gollub-1462
Last seen 9.7 years ago
Adaikalavan Ramasamy wrote: > > 1) First we preprocess each array with LOESS followed by scale > normalisation before combining with other arrays. This is done is the > first half of the function my.SMD.expr(). Generally reasonable for an expression experiment in SMD, I think, although all of the usual concerns about normalization apply. There are some very small arrays which might benefit from different treatment, but at a guess you're probably dealing with human cancer data...? > 2) Next, we average the log ratios over the LUID or SUID (for old SMD > dataset) and removing redundant gene annotations. This is done in > get.SMD.expr(). This is potentially the a contentious issue. This is generally the way it's done by SMD users. I think some of the regulars on this list prefer different approaches. I'm not certain why you'd use LUID for newer data? SUID = Sequence Unique ID. An SUID should (and almost always does) refer to a single "reporter" DNA sequence (oligo, PCR product, or cDNA clone spotted on an array). The same cDNA, on multiple arrays, should always have the same SUID within SMD, regardless of the fabrication history. Identification is by public accession (e.g., GenBank for human ESTs) whenever possible. LUID = Laboratory Unique ID. This refers to the product of a distinct preparation, e.g. a single microtiter plate well for cDNA clones, or a single synthesis of a particular oligo. When it's available at all, LUID should be entirely nested within SUID; i.e., a single oligo in SMD should have a single SUID, but might have several LUIDs if it was synthesized multiple times or in different places. LUID is generally used to track fabrication process quality issues, such as discovery and isolation of a contaminated or mislabeled set of spots. If you're not concerned about or interested in QA issues, you probably don't want to work in LUID space. > 3) Finally we merge the different arrays by using the LUID and the > average gene expression for that LUID that was calculated in step 2. Subject to the note above about the interpretation of LUIDs, this is standard practice in SMD and probably pretty reasonable. - Jeremy (former SMD developer) -- Jeremy Gollub jeremy at gollub.net
ADD COMMENT
0
Entering edit mode
Dear Dr. Gollub, Sorry for the delay in responding as I was debugging my functions. Thank you for your detailed explanation which made many things clearer. And thank you for developing SMD which is an earlier pioneers of open repositories and set the standards for many later repositories. The reason why I used LUID is that read.maimages( source="smd" ) function in LIMMA stores Luid information instead of SUID by default. I will email Dr. Symth to inform him of this. Please ignore the codes in my previous email as the order of merge and normalizeBetween arrays was incorrect. The codes below are correct and slightly optimised for memory consumption. ---------------------------------------------------------------------- - normalize.SMD <- function(files, source=c("smd", "smd.old"), annotation=c("SUID", "Clone ID", "Gene Symbol", "Cluster ID", "Accession")){ ## 1. Uses LOESS normalization within arrays ## 2. Merges different print batches into a coherent set ## 3. Uses scale normalization between arrays for the set as above require(limma) for(i in 1:length(files)){ cat(i, "\t") fn <- files[i] tmp1 <- read.maimages( files=fn, source=source, annotation=annotation ) tmp2 <- normalizeWithinArrays( tmp1, method="loess" ) ## Average M and A with the same SUID suid <- as.character(tmp2$genes[ , "SUID"]) Mval <- tapply( as.numeric(tmp2$M), suid, mean, na.rm=TRUE ) Aval <- tapply( as.numeric(tmp2$A), suid, mean, na.rm=TRUE ) stopifnot( identical( names(Mval), names(Aval) ) ) df <- data.frame( names(Mval), Mval, Aval ) colnames(df) <- c("SUID", paste(fn, "M", sep="_"), paste(fn, "A", sep="_")) if(i==1){ expr <- df ## initialize the expression matrix ann <- unique(tmp2$genes) ## initialize database of annotations } else { expr <- merge(expr, df, by="SUID", all=TRUE) ann <- unique( rbind(ann, tmp2$genes) ) } rm(fn, tmp1, tmp2, suid, Mval, Aval, df); gc() } expr <- expr[ order(as.character(expr$SUID)), ] rownames(expr) <- as.character(expr$SUID); expr$SUID <- NULL ann <- ann [ order(as.character(ann$SUID)), ] rownames(ann) <- as.character(ann$SUID); ann$SUID <- NULL stopifnot( identical( rownames(expr), rownames(ann) ) ) Mmat <- as.matrix( expr[ , grep( "_M$", colnames(expr) ) ] ) Amat <- as.matrix( expr[ , grep( "_A$", colnames(expr) ) ] ) obj <- new( "MAList", list(M=Mmat, A=Amat, genes=ann) ) out <- normalizeBetweenArrays( obj, method="scale" ) return(out) } ## sample usage to normalize inputs <- list.files(pattern="xls$", full.names=TRUE) output <- normalize.SMD( inputs, source="smd" ) ## might also want remove probes with less than 75% presence good <- which( rowMeans( !is.na( output ) ) > 0.75 ) output <- output[ good, ] ---------------------------------------------------------------------- - Regards, Adai
ADD REPLY

Login before adding your answer.

Traffic: 661 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6