Biobase ExpressionSet: metadata on assayData

0

Entering edit mode

Eric Lecoutre ▴ 40

@eric-lecoutre-2540

Last seen 10.1 years ago

An embedded and charset-unspecified text was scrubbed... Name: not available Url: https://stat.ethz.ch/pipermail/bioconductor/attachments/20071214/ c9a4fe4e/attachment.pl

• 1.2k views

ADD COMMENT • link updated 16.8 years ago by Martin Morgan 25k • written 16.8 years ago by Eric Lecoutre ▴ 40

0

Entering edit mode

Martin Morgan 25k

@martin-morgan-1513

Last seen 11 weeks ago

United States

Hi Eric -- * ExpressionSet ExpressionSet itself is meant for gene expression data. The 'assay' data is essentially a matrix of 'features' (genes / probes) x 'phenotypes' (samples). The assay data is annotated on both features and phenotypes. The phenotypes are annotated with the AnnotatedDataFrame in the phenoData slot. This would typically include all the information about experimental design relevant to the samples. The features _can_ be annotated with the AnnotatedDataFrame in the featureData slot. However, for expression data, features and their annotations are usually common across chips. For this reason the annotations are usually stored independently of the assay data, in the so-called 'annotation' packages named after the chip and referenced by the 'annotation' slot in the expression set. Finally, information about the overall experiment summarized in the assay data can stored in the container in the experimentData slot. * Actually used? A typical single-channel microarray work flow starts with ReadAffy followed by pre-processing. The output is an ExpressionSet. The main downstream analytic pathways either expect or work with ExpressionSet. Many users probably rely implicitly or explicitly on ExpressionSet, and there are dozens of data sets from actual analyses on the Bioconductor web site. So yes, they're actually used. It is not hard to use a rudimentary expression set starting from scratch, > library(Biobase) > m <- matrix(runif(100000), ncol=10) > e <- new("ExpressionSet", exprs=m) Of course there is no metadata, but that can be added either at construction or subsequently (as described in one of the Biobase vignettes, An Introduction to Biobase and Expression Sets). * Data other than microarrays ExpressionSet is meant for summarized gene expression data. ExpressionSet is derived from an underlying class eSet. Projects interested in other types of data have used eSet (and AnnotatedDataFrame) as a basis for packaging other data types (e.g., the flowCore projects looking at flow cytometry). This is great, because the adoption of common data structures can greatly facilitate interoperability. Hope that helps, Martin "Eric Lecoutre" <ericlecoutre at="" gmail.com=""> writes: > Hi, > > I am new to Bioconductor and am studying both biobase and biostatistics for > a small project. > My client wants to know wether he should use ExpressionSet for part of its > assay R&D process. > For a experiment, I understand there is a lot of common metadata like > compound, dose level, replicate,... > I have seen phylo and feature dataframe class AnnotatedDataFrame and already > said to the client he could use that. > Fact is that those metadata (if I have weell understand) also could be used > for gene expression (so addayData). > What is the standard BioConductor way to handle those metadata? : there is > no metadata argument associated to assayData. > Should I use an AnnotatedDataFrame for feature repeting gene expression with > such metadata? > > btw, are there people here who really use ExpressionSet in their processes? > > Thanks for any insight. > > > Eric > > > PS: as I looked at AnnotatedDataFrame class, I missed a helper function to > exploit metadata. > Here is such a little function and a sample use, where one requests for > variables in AnnotatedDataFrame with conditions on metadata (arbitrary ones, > handled by dots ...) > > > > > selectVariables <- function(x,logic=all,drop=FALSE,...){ > listCriteria <- list(...) > metadata <- varMetadata(x) > retainedCriteria <- list() > sapply(names(listCriteria), function(critname) { > if(!critname %in% colnames(metadata)){ > cat("\n Dropped criteria:",critname, "not in AnnotatedDataFrame\n") > }else{ > if(is.null(listCriteria[critname])) listCriteria[[critname]]<- > unique(metadata[,critname]) > retainedCriteria[[critname]] <<- metadata[,critname] %in% > listCriteria[critname] > } > }) > criteriaValues <- do.call("cbind",retainedCriteria) > selectedColumns <<- apply(criteriaValues,1,logic) > cat('\n',sum(selectedColumns),' columns selected.\n',sep='') > return(selectedColumns) > } > > > > > library(Biobase) > # prepating metadata > treatment=c("D","192","233","192","233") > control=c(1,0,0,0,0) > dose=c(NA,30,10,10,0.3) > replicate=rep(1,5) > metadata <- data.frame > (cbind(treatment=treatment,control=control,dose=dose,replicate=repli cate, > labelDescription=paste("treatment: ",treatment, ifelse(control==1, " > [control]","")," dose:",dose,"(",replicate,")",sep=''))) > > data1=data.frame(cbind(v1=1:2,v2=2:3,v3=3:4,v4=4:5,v5=5:6)) > anData1 = new("AnnotatedDataFrame",data=data1,varMetadata=metadata) > > > # use little function to create an subset data.frame > > anData1[,selectVariables(anData1,dose=10, dummy=0)] > > > > > > -- > Eric Lecoutre > Consultant - Business & Decision > Business Intelligence & Customer Intelligence > > [[alternative HTML version deleted]] > > _______________________________________________ > Bioconductor mailing list > Bioconductor at stat.math.ethz.ch > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor -- Martin Morgan Computational Biology / Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N. PO Box 19024 Seattle, WA 98109 Location: Arnold Building M2 B169 Phone: (206) 667-2793

ADD COMMENT • link 16.8 years ago Martin Morgan 25k

0

Entering edit mode

Eric Lecoutre ▴ 40

@eric-lecoutre-2540

Last seen 10.1 years ago

An embedded and charset-unspecified text was scrubbed... Name: not available Url: https://stat.ethz.ch/pipermail/bioconductor/attachments/20071219/ 9342afba/attachment.pl

ADD COMMENT • link 16.8 years ago Eric Lecoutre ▴ 40

0

Entering edit mode

Martin Morgan 25k

@martin-morgan-1513

Last seen 11 weeks ago

United States

Hi Eric -- Glad to be of help. I did not see your 'ps' in the original message; perahps the following has been clarified, but in case not... Are you sure you are using AnnotatedDataFrame as it is intended? As a concrete example: > library(Biobase) > data(sample.ExpressionSet) > dim(sample.ExpressionSet) Features Samples 500 26 > phenoData(sample.ExpressionSet) An object of class "AnnotatedDataFrame" sampleNames: A, B, ..., Z (26 total) varLabels and varMetadata description: sex: Female/Male type: Case/Control score: Testing Score > pData(sample.ExpressionSet) sex type score A Female Control 0.75 B Male Case 0.40 C Male Control 0.73 D Male Case 0.42 E Female Case 0.93 F Male Control 0.22 G Male Case 0.96 H Male Case 0.79 I Female Case 0.37 J Male Control 0.63 K Male Case 0.26 L Female Control 0.36 M Male Case 0.41 N Male Case 0.80 O Female Case 0.10 P Female Control 0.41 Q Female Case 0.16 R Male Control 0.72 S Male Case 0.17 T Female Case 0.74 U Male Control 0.35 V Female Control 0.77 W Male Control 0.27 X Male Control 0.98 Y Female Case 0.94 Z Female Case 0.32 > varMetadata(sample.ExpressionSet) labelDescription sex Female/Male type Case/Control score Testing Score pData returns the data.frame describing phenotypes, varMetadata returns the meta-data describing the columns of pData. It's not clear from your example below what variables v1 through v5 are meant to represent, but your 'meta-data' >> treatment=c("D","192","233","192","233") >> control=c(1,0,0,0,0) >> dose=c(NA,30,10,10,0.3) >> replicate=rep(1,5) seems really to be meant (when appropriately rearranged) as components of pData. pData and varMetadata are defined to make access to the underyling phenotypic data easy. The actual structure of the object is more accurately represented by the function calls > adf <- phenoData(sample.ExpressionSet) # phenotype AnnotatedDataFrame > df <- pData(adf) # 'data' part of AnnotatedDataFrame > md <- varMetadata(adf) # meta-data, of AnnotatedDataFrame >From your function below, it looks like you're trying to select columns of pData based on their varMetadata. I'm not sure whether there is a strong use case of this, but here's a little example > pData <- data.frame(X=1:5, Y=5:1, Z=letters[1:5]) > varMetadata <- data.frame( + labelDescription=c( + "X description", "Y description", "Z description"), + metaA=c(TRUE,TRUE,FALSE), + metaB=c(TRUE,FALSE,TRUE), + metaC=c("yes", "no", "no")) > adf <- new("AnnotatedDataFrame", + data=pData, varMetadata=varMetadata) For interactive use, I'd probably do something like > idx <- with(varMetadata(adf), metaA & metaB) > adf[,idx] An object of class "AnnotatedDataFrame" rowNames: 1, 2, ..., 5 (5 total) varLabels and varMetadata description: X: X description additional varMetadata: metaA, metaB, metaC (this could be written in a single line, e.g., adf[,varMetadata(adf)$metaA & varMetadata(adf)$megaB] but such brevity both is less efficient and more confusing). 'with' is providing an easy way to access the variables in varMetadata(adf). The second argument to 'with' can be a series of statements of the form, with(varMetadata(obj), { <your statements="" here...=""> }) Your goal seems to be to create complex selection criteria. For this case I find it very useful to stick to the paradigm of constructing logical vectors and using the vectorized logical operators &, | and t If you really wanted to make this kind of operation into a function call, I might > adfMetaSelect <- function(adf, ..., how=all) { + dots <- match.call(expand.dots=FALSE)[["..."]] + res <- lapply(dots, + function(elt, vm) with(vm, eval(elt)), + vm=varMetadata(adf)) + idx <- do.call(mapply, c(how, res)) + adf[,idx] + } > adfMetaSelect(adf, metaA, metaB) An object of class "AnnotatedDataFrame" rowNames: 1, 2, ..., 5 (5 total) varLabels and varMetadata description: X: X description additional varMetadata: metaA, metaB, metaC > adfMetaSelect(adf, metaA, !metaB) An object of class "AnnotatedDataFrame" rowNames: 1, 2, ..., 5 (5 total) varLabels and varMetadata description: Y: Y description additional varMetadata: metaA, metaB, metaC > adfMetaSelect(adf, metaC=="no") An object of class "AnnotatedDataFrame" rowNames: 1, 2, ..., 5 (5 total) varLabels and varMetadata description: Y: Y description Z: Z description additional varMetadata: metaA, metaB, metaC The 'how' argument specifies how the logical conditions provided in ... will be combined, in this case all conditions must be true. Probably the mapply could be replaced with how if 'how' were, e.g., get("&"). Perhaps this provides you with some ideas. Martin "Eric Lecoutre" <ericlecoutre at="" gmail.com=""> writes: > Hi Martin; > > With a little retard, thank for your detailed answer. > I did some time to go on with my investigations and now things are more > clear on what I should do with all those data (and mostly that I have to use > phenotypic slot for my data on cell lines). > There are nearly 100 cell lines used by my client, thus it is really worth > using ExpressionSet structure for further analysis. > > Best wishes, > > Eric > > > > 2007/12/14, Eric Lecoutre <ericlecoutre at="" gmail.com="">: >> >> Hi, >> >> I am new to Bioconductor and am studying both biobase and biostatistics >> for a small project. >> My client wants to know wether he should use ExpressionSet for part of its >> assay R&D process. >> For a experiment, I understand there is a lot of common metadata like >> compound, dose level, replicate,... >> I have seen phylo and feature dataframe class AnnotatedDataFrame and >> already said to the client he could use that. >> Fact is that those metadata (if I have weell understand) also could be >> used for gene expression (so addayData). >> What is the standard BioConductor way to handle those metadata? : there is >> no metadata argument associated to assayData. >> Should I use an AnnotatedDataFrame for feature repeting gene expression >> with such metadata? >> >> btw, are there people here who really use ExpressionSet in their >> processes? >> >> Thanks for any insight. >> >> >> Eric >> >> >> PS: as I looked at AnnotatedDataFrame class, I missed a helper function to >> exploit metadata. >> Here is such a little function and a sample use, where one requests for >> variables in AnnotatedDataFrame with conditions on metadata (arbitrary ones, >> handled by dots ...) >> >> >> >> >> selectVariables <- function(x,logic=all,drop=FALSE,...){ >> listCriteria <- list(...) >> metadata <- varMetadata(x) >> retainedCriteria <- list() >> sapply(names(listCriteria), function(critname) { >> if(!critname %in% colnames(metadata)){ >> cat("\n Dropped criteria:",critname, "not in AnnotatedDataFrame\n") >> }else{ >> if(is.null(listCriteria[critname])) listCriteria[[critname]]<- >> unique(metadata[,critname]) >> retainedCriteria[[critname]] <<- metadata[,critname] %in% >> listCriteria[critname] >> } >> }) >> criteriaValues <- do.call("cbind",retainedCriteria) >> selectedColumns <<- apply(criteriaValues,1,logic) >> cat('\n',sum(selectedColumns),' columns selected.\n',sep='') >> return(selectedColumns) >> } >> >> >> >> >> library(Biobase) >> # prepating metadata >> treatment=c("D","192","233","192","233") >> control=c(1,0,0,0,0) >> dose=c(NA,30,10,10,0.3) >> replicate=rep(1,5) >> metadata <- data.frame >> (cbind(treatment=treatment,control=control,dose=dose,replicate=repl icate, >> labelDescription=paste("treatment: ",treatment, ifelse(control==1, " >> [control]","")," dose:",dose,"(",replicate,")",sep=''))) >> >> data1=data.frame(cbind(v1=1:2,v2=2:3,v3=3:4,v4=4:5,v5=5:6)) >> anData1 = new("AnnotatedDataFrame",data=data1,varMetadata=metadata) >> >> >> # use little function to create an subset data.frame >> >> anData1[,selectVariables(anData1,dose=10, dummy=0)] >> >> >> >> >> >> -- >> Eric Lecoutre >> Consultant - Business & Decision >> Business Intelligence & Customer Intelligence >> > > > > -- > Eric Lecoutre > Consultant - Business & Decision > Business Intelligence & Customer Intelligence > > [[alternative HTML version deleted]] > > _______________________________________________ > Bioconductor mailing list > Bioconductor at stat.math.ethz.ch > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor -- Martin Morgan Computational Biology / Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N. PO Box 19024 Seattle, WA 98109 Location: Arnold Building M2 B169 Phone: (206) 667-2793

ADD COMMENT • link 16.8 years ago Martin Morgan 25k

0

Entering edit mode

An embedded and charset-unspecified text was scrubbed... Name: not available Url: https://stat.ethz.ch/pipermail/bioconductor/attachments/20071221/ 8c526fb3/attachment.pl

ADD REPLY • link 16.8 years ago Eric Lecoutre ▴ 40

Login before adding your answer.