Question: question subsetting expressionSet
0
gravatar for Guido Hooiveld
7.8 years ago by
Guido Hooiveld2.5k
Wageningen University, Wageningen, the Netherlands
Guido Hooiveld2.5k wrote:
Hi, Just to confirm I am doing things properly: I have created an expressionSet from a set of 120 Affymetrix arrays and I also added some metadata (phenoData) to that expressionSet. This all goes OK. Now I would like to subset the expressionSet based on one of the variables described in the phenoData. Although I am able to 'extract' the proper arrays, I noticed something unexpected when looking at the phenoData of the new, subset object; the phenoData slot that has been used to subset *seems* to still have 3 levels, whereas I expect only one level. This behaviour also occurs for the other variables of the phenoData dataframe (i.e. more levels are reported than are present). To be sure, can anyone explain if this is to be expected, or whether I do something wrong? Thanks, Guido # read data & normalize >library(affyPLM) >pheno <- read.delim(file="A213_metadata.txt", row.names=1) >affy.data <- ReadAffy(cdfname="mogene11stv1mmentrezg", phenoData=as.data.frame(pheno)) > > # check >validObject(affy.data) [1] TRUE > > # normalize >x.norm <- fitPLM(affy.data) ># convert PLMset to eSet! >x.norm <- pset2eset(x.norm) > #check > validObject(x.norm) [1] TRUE > > x.norm ExpressionSet (storageMode: lockedEnvironment) assayData: 21225 features, 120 samples element names: exprs, se.exprs protocolData: none phenoData sampleNames: G014_A05_01_801_I1_chow.CEL G014_A07_09_809_I1_HF.CEL ... G020_H12_120_824_I10_HF.CEL (120 total) varLabels: Simil Diet ... Labeling (5 total) varMetadata: labelDescription featureData: none experimentData: use 'experimentData(object)' Annotation: mogene11stv1mmentrezg > dim(x.norm) Features Samples 21225 120 > > #Check Diet assignment > x.norm$Diet [1] chow hfd lfd chow hfd lfd chow hfd lfd chow hfd lfd lfd chow hfd [16] lfd chow hfd lfd chow hfd lfd chow hfd chow chow chow chow lfd lfd [31] lfd lfd hfd hfd hfd hfd chow chow chow chow lfd lfd lfd lfd hfd [46] hfd hfd hfd chow chow chow chow lfd lfd lfd lfd hfd hfd hfd hfd [61] chow chow chow chow lfd lfd lfd lfd hfd hfd hfd hfd chow chow chow [76] chow lfd lfd lfd lfd hfd hfd hfd hfd chow chow chow chow lfd lfd [91] lfd lfd hfd hfd hfd hfd chow chow chow chow lfd lfd lfd lfd hfd [106] hfd hfd hfd chow chow chow chow lfd lfd lfd lfd hfd hfd hfd hfd Levels: chow hfd lfd > > str(x.norm) <<snip> ..@ phenoData :Formal class 'AnnotatedDataFrame' [package "Biobase"] with 4 slots .. .. ..@ varMetadata :'data.frame': 5 obs. of 1 variable: .. .. .. ..$ labelDescription: chr [1:5] NA NA NA NA ... .. .. ..@ data :'data.frame': 120 obs. of 5 variables: .. .. .. ..$ Simil : Factor w/ 10 levels "i1","i10","i2",..: 1 1 3 1 1 3 1 1 3 1 ... .. .. .. ..$ Diet : Factor w/ 3 levels "chow","hfd","lfd": 1 2 3 1 2 3 1 2 3 1 ... .. .. .. ..$ Group : Factor w/ 30 levels "i10_chow","i10_hfd",..: 4 5 9 4 5 9 4 5 9 4 ... .. .. .. ..$ Plate : Factor w/ 2 levels "G014","G020": 1 1 1 1 1 1 1 1 1 1 ... .. .. .. ..$ Labeling: int [1:120] 3 1 2 1 1 2 1 1 2 1 ... So far, so good. Now I would like to extract data of only the 40 chow samples by subsetting x.norm on variable 'Diet'. ># backup x.norm > x.norm2 <- x.norm > > #subset only chow samples > x.norm <- x.norm2[,x.norm2$Diet=="chow"] > dim(x.norm) Features Samples 21225 40 Subsetting samples seem to go OK... > #Again check Diet assigment > x.norm$Diet [1] chow chow chow chow chow chow chow chow chow chow chow chow chow chow chow [16] chow chow chow chow chow chow chow chow chow chow chow chow chow chow chow [31] chow chow chow chow chow chow chow chow chow chow Levels: chow hfd lfd > ^^ why are there still 3 levels; i expected only one level, namely "chow" > str(x.norm) <<snip> ..@ phenoData :Formal class 'AnnotatedDataFrame' [package "Biobase"] with 4 slots .. .. ..@ varMetadata :'data.frame': 5 obs. of 1 variable: .. .. .. ..$ labelDescription: chr [1:5] NA NA NA NA ... .. .. ..@ data :'data.frame': 40 obs. of 5 variables: .. .. .. ..$ Simil : Factor w/ 10 levels "i1","i10","i2",..: 1 1 1 1 3 3 3 3 4 4 ... .. .. .. ..$ Diet : Factor w/ 3 levels "chow","hfd","lfd": 1 1 1 1 1 1 1 1 1 1 ... .. .. .. ..$ Group : Factor w/ 30 levels "i10_chow","i10_hfd",..: 4 4 4 4 7 7 7 7 10 10 ... .. .. .. ..$ Plate : Factor w/ 2 levels "G014","G020": 1 1 1 1 1 1 1 1 2 2 ... .. .. .. ..$ Labeling: int [1:40] 3 1 1 1 2 2 2 2 3 3 ... ^^ idem, why are for all variables the 'original' levels reported and not the subset ones? > sessionInfo() R version 2.14.0 (2011-10-31) Platform: x86_64-unknown-linux-gnu (64-bit) locale: [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 [7] LC_PAPER=C LC_NAME=C [9] LC_ADDRESS=C LC_TELEPHONE=C [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C attached base packages: [1] stats graphics grDevices utils datasets methods base other attached packages: [1] SpeCond_1.8.0 RColorBrewer_1.0-5 [3] hwriter_1.3 fields_6.6.3 [5] spam_0.27-0 mclust_3.4.11 [7] mogene11stv1mmentrezgcdf_14.1.0 affyPLM_1.30.0 [9] preprocessCore_1.16.0 gcrma_2.26.0 [11] affy_1.33.2 Biobase_2.14.0 [13] BiocGenerics_0.1.3 loaded via a namespace (and not attached): [1] affyio_1.22.0 BiocInstaller_1.2.1 Biostrings_2.22.0 [4] IRanges_1.12.5 splines_2.14.0 tools_2.14.0 [7] zlibbioc_1.0.0 > --------------------------------------------------------- Guido Hooiveld, PhD Nutrition, Metabolism & Genomics Group Division of Human Nutrition Wageningen University Biotechnion, Bomenweg 2 NL-6703 HD Wageningen the Netherlands tel: (+)31 317 485788 fax: (+)31 317 483342 email: guido.hooiveld@wur.nl internet: http://nutrigene.4t.com http://scholar.google.com/citations?user=qFHaMnoAAAAJ http://www.researcherid.com/rid/F-4912-2010 [[alternative HTML version deleted]]
go convert • 625 views
ADD COMMENTlink modified 7.8 years ago by Oosting, J. PATH550 • written 7.8 years ago by Guido Hooiveld2.5k
Answer: question subsetting expressionSet
0
gravatar for Martin Morgan
7.8 years ago by
Martin Morgan ♦♦ 23k
United States
Martin Morgan ♦♦ 23k wrote:
On 01/11/2012 02:07 AM, Hooiveld, Guido wrote: > Hi, > > Just to confirm I am doing things properly: > I have created an expressionSet from a set of 120 Affymetrix arrays and I also added some metadata (phenoData) to that expressionSet. This all goes OK. Now I would like to subset the expressionSet based on one of the variables described in the phenoData. > Although I am able to 'extract' the proper arrays, I noticed something unexpected when looking at the phenoData of the new, subset object; the phenoData slot that has been used to subset *seems* to still have 3 levels, whereas I expect only one level. This behaviour also occurs for the other variables of the phenoData dataframe (i.e. more levels are reported than are present). To be sure, can anyone explain if this is to be expected, or whether I do something wrong? Hi Guido -- this is how R factors work > g = f[f=="M"] > g [1] M Levels: F M You could recast the factor, e.g., > factor(g) [1] M Levels: M or be satisfied that R is keeping track of important aspects of your original data. (in your script below, x.norm2 <- x.norm[,x.norm$Diet=="chow"] might have been more natural, rather than making a copy and subsetting the copy). Hope that helps, Martin > Thanks, > Guido > > > # read data& normalize >> library(affyPLM) > >> pheno<- read.delim(file="A213_metadata.txt", row.names=1) >> affy.data<- ReadAffy(cdfname="mogene11stv1mmentrezg", phenoData=as.data.frame(pheno)) >> >> # check >> validObject(affy.data) > [1] TRUE >> >> # normalize >> x.norm<- fitPLM(affy.data) >> # convert PLMset to eSet! >> x.norm<- pset2eset(x.norm) >> #check >> validObject(x.norm) > [1] TRUE >> >> x.norm > ExpressionSet (storageMode: lockedEnvironment) > assayData: 21225 features, 120 samples > element names: exprs, se.exprs > protocolData: none > phenoData > sampleNames: G014_A05_01_801_I1_chow.CEL G014_A07_09_809_I1_HF.CEL > ... G020_H12_120_824_I10_HF.CEL (120 total) > varLabels: Simil Diet ... Labeling (5 total) > varMetadata: labelDescription > featureData: none > experimentData: use 'experimentData(object)' > Annotation: mogene11stv1mmentrezg > >> dim(x.norm) > Features Samples > 21225 120 >> >> #Check Diet assignment >> x.norm$Diet > [1] chow hfd lfd chow hfd lfd chow hfd lfd chow hfd lfd lfd chow hfd > [16] lfd chow hfd lfd chow hfd lfd chow hfd chow chow chow chow lfd lfd > [31] lfd lfd hfd hfd hfd hfd chow chow chow chow lfd lfd lfd lfd hfd > [46] hfd hfd hfd chow chow chow chow lfd lfd lfd lfd hfd hfd hfd hfd > [61] chow chow chow chow lfd lfd lfd lfd hfd hfd hfd hfd chow chow chow > [76] chow lfd lfd lfd lfd hfd hfd hfd hfd chow chow chow chow lfd lfd > [91] lfd lfd hfd hfd hfd hfd chow chow chow chow lfd lfd lfd lfd hfd > [106] hfd hfd hfd chow chow chow chow lfd lfd lfd lfd hfd hfd hfd hfd > Levels: chow hfd lfd >> >> str(x.norm) > <<snip> > ..@ phenoData :Formal class 'AnnotatedDataFrame' [package "Biobase"] with 4 slots > .. .. ..@ varMetadata :'data.frame': 5 obs. of 1 variable: > .. .. .. ..$ labelDescription: chr [1:5] NA NA NA NA ... > .. .. ..@ data :'data.frame': 120 obs. of 5 variables: > .. .. .. ..$ Simil : Factor w/ 10 levels "i1","i10","i2",..: 1 1 3 1 1 3 1 1 3 1 ... > .. .. .. ..$ Diet : Factor w/ 3 levels "chow","hfd","lfd": 1 2 3 1 2 3 1 2 3 1 ... > .. .. .. ..$ Group : Factor w/ 30 levels "i10_chow","i10_hfd",..: 4 5 9 4 5 9 4 5 9 4 ... > .. .. .. ..$ Plate : Factor w/ 2 levels "G014","G020": 1 1 1 1 1 1 1 1 1 1 ... > .. .. .. ..$ Labeling: int [1:120] 3 1 2 1 1 2 1 1 2 1 ... > > So far, so good. > Now I would like to extract data of only the 40 chow samples by subsetting x.norm on variable 'Diet'. > >> # backup x.norm >> x.norm2<- x.norm >> >> #subset only chow samples >> x.norm<- x.norm2[,x.norm2$Diet=="chow"] >> dim(x.norm) > Features Samples > 21225 40 > > Subsetting samples seem to go OK... >> #Again check Diet assigment >> x.norm$Diet > [1] chow chow chow chow chow chow chow chow chow chow chow chow chow chow chow > [16] chow chow chow chow chow chow chow chow chow chow chow chow chow chow chow > [31] chow chow chow chow chow chow chow chow chow chow > Levels: chow hfd lfd >> > > ^^ why are there still 3 levels; i expected only one level, namely "chow" > >> str(x.norm) > <<snip> > ..@ phenoData :Formal class 'AnnotatedDataFrame' [package "Biobase"] with 4 slots > .. .. ..@ varMetadata :'data.frame': 5 obs. of 1 variable: > .. .. .. ..$ labelDescription: chr [1:5] NA NA NA NA ... > .. .. ..@ data :'data.frame': 40 obs. of 5 variables: > .. .. .. ..$ Simil : Factor w/ 10 levels "i1","i10","i2",..: 1 1 1 1 3 3 3 3 4 4 ... > .. .. .. ..$ Diet : Factor w/ 3 levels "chow","hfd","lfd": 1 1 1 1 1 1 1 1 1 1 ... > .. .. .. ..$ Group : Factor w/ 30 levels "i10_chow","i10_hfd",..: 4 4 4 4 7 7 7 7 10 10 ... > .. .. .. ..$ Plate : Factor w/ 2 levels "G014","G020": 1 1 1 1 1 1 1 1 2 2 ... > .. .. .. ..$ Labeling: int [1:40] 3 1 1 1 2 2 2 2 3 3 ... > > ^^ idem, why are for all variables the 'original' levels reported and not the subset ones? > > > > >> sessionInfo() > R version 2.14.0 (2011-10-31) > Platform: x86_64-unknown-linux-gnu (64-bit) > > locale: > [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C > [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 > [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 > [7] LC_PAPER=C LC_NAME=C > [9] LC_ADDRESS=C LC_TELEPHONE=C > [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C > > attached base packages: > [1] stats graphics grDevices utils datasets methods base > > other attached packages: > [1] SpeCond_1.8.0 RColorBrewer_1.0-5 > [3] hwriter_1.3 fields_6.6.3 > [5] spam_0.27-0 mclust_3.4.11 > [7] mogene11stv1mmentrezgcdf_14.1.0 affyPLM_1.30.0 > [9] preprocessCore_1.16.0 gcrma_2.26.0 > [11] affy_1.33.2 Biobase_2.14.0 > [13] BiocGenerics_0.1.3 > > loaded via a namespace (and not attached): > [1] affyio_1.22.0 BiocInstaller_1.2.1 Biostrings_2.22.0 > [4] IRanges_1.12.5 splines_2.14.0 tools_2.14.0 > [7] zlibbioc_1.0.0 >> > > > --------------------------------------------------------- > Guido Hooiveld, PhD > Nutrition, Metabolism& Genomics Group > Division of Human Nutrition > Wageningen University > Biotechnion, Bomenweg 2 > NL-6703 HD Wageningen > the Netherlands > tel: (+)31 317 485788 > fax: (+)31 317 483342 > email: guido.hooiveld at wur.nl > internet: http://nutrigene.4t.com > http://scholar.google.com/citations?user=qFHaMnoAAAAJ > http://www.researcherid.com/rid/F-4912-2010 > > > [[alternative HTML version deleted]] > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor -- Computational Biology Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N. PO Box 19024 Seattle, WA 98109 Location: M1-B861 Telephone: 206 667-2793
ADD COMMENTlink written 7.8 years ago by Martin Morgan ♦♦ 23k
Answer: question subsetting expressionSet
0
gravatar for Oosting, J. PATH
7.8 years ago by
Oosting, J. PATH550 wrote:
An alternative would be to read in your phenodata with as.is=TRUE. Then all variables in the dataframe will be vectors. You can generate the factors when you need them ie. When constructing a model for the analysis. pheno <- read.delim(file="A213_metadata.txt", row.names=1,as.is=TRUE) Jan I think the as.data.frame() is superfluous. The result of read.delim is already a dataframe > > # read data & normalize > >library(affyPLM) > > >pheno <- read.delim(file="A213_metadata.txt", row.names=1) > >affy.data <- ReadAffy(cdfname="mogene11stv1mmentrezg", > phenoData=as.data.frame(pheno)) > >
ADD COMMENTlink written 7.8 years ago by Oosting, J. PATH550
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 16.09
Traffic: 213 users visited in the last hour