question subsetting expressionSet

0

Entering edit mode

Guido Hooiveld ★ 4.1k

@guido-hooiveld-2020

Last seen 6 weeks ago

Wageningen University, Wageningen, the …

Hi, Just to confirm I am doing things properly: I have created an expressionSet from a set of 120 Affymetrix arrays and I also added some metadata (phenoData) to that expressionSet. This all goes OK. Now I would like to subset the expressionSet based on one of the variables described in the phenoData. Although I am able to 'extract' the proper arrays, I noticed something unexpected when looking at the phenoData of the new, subset object; the phenoData slot that has been used to subset *seems* to still have 3 levels, whereas I expect only one level. This behaviour also occurs for the other variables of the phenoData dataframe (i.e. more levels are reported than are present). To be sure, can anyone explain if this is to be expected, or whether I do something wrong? Thanks, Guido # read data & normalize >library(affyPLM) >pheno <- read.delim(file="A213_metadata.txt", row.names=1) >affy.data <- ReadAffy(cdfname="mogene11stv1mmentrezg", phenoData=as.data.frame(pheno)) > > # check >validObject(affy.data) [1] TRUE > > # normalize >x.norm <- fitPLM(affy.data) ># convert PLMset to eSet! >x.norm <- pset2eset(x.norm) > #check > validObject(x.norm) [1] TRUE > > x.norm ExpressionSet (storageMode: lockedEnvironment) assayData: 21225 features, 120 samples element names: exprs, se.exprs protocolData: none phenoData sampleNames: G014_A05_01_801_I1_chow.CEL G014_A07_09_809_I1_HF.CEL ... G020_H12_120_824_I10_HF.CEL (120 total) varLabels: Simil Diet ... Labeling (5 total) varMetadata: labelDescription featureData: none experimentData: use 'experimentData(object)' Annotation: mogene11stv1mmentrezg > dim(x.norm) Features Samples 21225 120 > > #Check Diet assignment > x.norm$Diet [1] chow hfd lfd chow hfd lfd chow hfd lfd chow hfd lfd lfd chow hfd [16] lfd chow hfd lfd chow hfd lfd chow hfd chow chow chow chow lfd lfd [31] lfd lfd hfd hfd hfd hfd chow chow chow chow lfd lfd lfd lfd hfd [46] hfd hfd hfd chow chow chow chow lfd lfd lfd lfd hfd hfd hfd hfd [61] chow chow chow chow lfd lfd lfd lfd hfd hfd hfd hfd chow chow chow [76] chow lfd lfd lfd lfd hfd hfd hfd hfd chow chow chow chow lfd lfd [91] lfd lfd hfd hfd hfd hfd chow chow chow chow lfd lfd lfd lfd hfd [106] hfd hfd hfd chow chow chow chow lfd lfd lfd lfd hfd hfd hfd hfd Levels: chow hfd lfd > > str(x.norm) <<snip> ..@ phenoData :Formal class 'AnnotatedDataFrame' [package "Biobase"] with 4 slots .. .. ..@ varMetadata :'data.frame': 5 obs. of 1 variable: .. .. .. ..$ labelDescription: chr [1:5] NA NA NA NA ... .. .. ..@ data :'data.frame': 120 obs. of 5 variables: .. .. .. ..$ Simil : Factor w/ 10 levels "i1","i10","i2",..: 1 1 3 1 1 3 1 1 3 1 ... .. .. .. ..$ Diet : Factor w/ 3 levels "chow","hfd","lfd": 1 2 3 1 2 3 1 2 3 1 ... .. .. .. ..$ Group : Factor w/ 30 levels "i10_chow","i10_hfd",..: 4 5 9 4 5 9 4 5 9 4 ... .. .. .. ..$ Plate : Factor w/ 2 levels "G014","G020": 1 1 1 1 1 1 1 1 1 1 ... .. .. .. ..$ Labeling: int [1:120] 3 1 2 1 1 2 1 1 2 1 ... So far, so good. Now I would like to extract data of only the 40 chow samples by subsetting x.norm on variable 'Diet'. ># backup x.norm > x.norm2 <- x.norm > > #subset only chow samples > x.norm <- x.norm2[,x.norm2$Diet=="chow"] > dim(x.norm) Features Samples 21225 40 Subsetting samples seem to go OK... > #Again check Diet assigment > x.norm$Diet [1] chow chow chow chow chow chow chow chow chow chow chow chow chow chow chow [16] chow chow chow chow chow chow chow chow chow chow chow chow chow chow chow [31] chow chow chow chow chow chow chow chow chow chow Levels: chow hfd lfd > ^^ why are there still 3 levels; i expected only one level, namely "chow" > str(x.norm) <<snip> ..@ phenoData :Formal class 'AnnotatedDataFrame' [package "Biobase"] with 4 slots .. .. ..@ varMetadata :'data.frame': 5 obs. of 1 variable: .. .. .. ..$ labelDescription: chr [1:5] NA NA NA NA ... .. .. ..@ data :'data.frame': 40 obs. of 5 variables: .. .. .. ..$ Simil : Factor w/ 10 levels "i1","i10","i2",..: 1 1 1 1 3 3 3 3 4 4 ... .. .. .. ..$ Diet : Factor w/ 3 levels "chow","hfd","lfd": 1 1 1 1 1 1 1 1 1 1 ... .. .. .. ..$ Group : Factor w/ 30 levels "i10_chow","i10_hfd",..: 4 4 4 4 7 7 7 7 10 10 ... .. .. .. ..$ Plate : Factor w/ 2 levels "G014","G020": 1 1 1 1 1 1 1 1 2 2 ... .. .. .. ..$ Labeling: int [1:40] 3 1 1 1 2 2 2 2 3 3 ... ^^ idem, why are for all variables the 'original' levels reported and not the subset ones? > sessionInfo() R version 2.14.0 (2011-10-31) Platform: x86_64-unknown-linux-gnu (64-bit) locale: [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 [7] LC_PAPER=C LC_NAME=C [9] LC_ADDRESS=C LC_TELEPHONE=C [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C attached base packages: [1] stats graphics grDevices utils datasets methods base other attached packages: [1] SpeCond_1.8.0 RColorBrewer_1.0-5 [3] hwriter_1.3 fields_6.6.3 [5] spam_0.27-0 mclust_3.4.11 [7] mogene11stv1mmentrezgcdf_14.1.0 affyPLM_1.30.0 [9] preprocessCore_1.16.0 gcrma_2.26.0 [11] affy_1.33.2 Biobase_2.14.0 [13] BiocGenerics_0.1.3 loaded via a namespace (and not attached): [1] affyio_1.22.0 BiocInstaller_1.2.1 Biostrings_2.22.0 [4] IRanges_1.12.5 splines_2.14.0 tools_2.14.0 [7] zlibbioc_1.0.0 > --------------------------------------------------------- Guido Hooiveld, PhD Nutrition, Metabolism & Genomics Group Division of Human Nutrition Wageningen University Biotechnion, Bomenweg 2 NL-6703 HD Wageningen the Netherlands tel: (+)31 317 485788 fax: (+)31 317 483342 email: guido.hooiveld@wur.nl internet: http://nutrigene.4t.com http://scholar.google.com/citations?user=qFHaMnoAAAAJ http://www.researcherid.com/rid/F-4912-2010 [[alternative HTML version deleted]]

GO convert GO convert • 1.4k views

ADD COMMENT • link updated 14.0 years ago by Oosting, J. PATH ▴ 550 • written 14.0 years ago by Guido Hooiveld ★ 4.1k

0

Entering edit mode

Martin Morgan 25k

@martin-morgan-1513

Last seen 11 months ago

United States

On 01/11/2012 02:07 AM, Hooiveld, Guido wrote: > Hi, > > Just to confirm I am doing things properly: > I have created an expressionSet from a set of 120 Affymetrix arrays and I also added some metadata (phenoData) to that expressionSet. This all goes OK. Now I would like to subset the expressionSet based on one of the variables described in the phenoData. > Although I am able to 'extract' the proper arrays, I noticed something unexpected when looking at the phenoData of the new, subset object; the phenoData slot that has been used to subset *seems* to still have 3 levels, whereas I expect only one level. This behaviour also occurs for the other variables of the phenoData dataframe (i.e. more levels are reported than are present). To be sure, can anyone explain if this is to be expected, or whether I do something wrong? Hi Guido -- this is how R factors work > g = f[f=="M"] > g [1] M Levels: F M You could recast the factor, e.g., > factor(g) [1] M Levels: M or be satisfied that R is keeping track of important aspects of your original data. (in your script below, x.norm2 <- x.norm[,x.norm$Diet=="chow"] might have been more natural, rather than making a copy and subsetting the copy). Hope that helps, Martin > Thanks, > Guido > > > # read data& normalize >> library(affyPLM) > >> pheno<- read.delim(file="A213_metadata.txt", row.names=1) >> affy.data<- ReadAffy(cdfname="mogene11stv1mmentrezg", phenoData=as.data.frame(pheno)) >> >> # check >> validObject(affy.data) > [1] TRUE >> >> # normalize >> x.norm<- fitPLM(affy.data) >> # convert PLMset to eSet! >> x.norm<- pset2eset(x.norm) >> #check >> validObject(x.norm) > [1] TRUE >> >> x.norm > ExpressionSet (storageMode: lockedEnvironment) > assayData: 21225 features, 120 samples > element names: exprs, se.exprs > protocolData: none > phenoData > sampleNames: G014_A05_01_801_I1_chow.CEL G014_A07_09_809_I1_HF.CEL > ... G020_H12_120_824_I10_HF.CEL (120 total) > varLabels: Simil Diet ... Labeling (5 total) > varMetadata: labelDescription > featureData: none > experimentData: use 'experimentData(object)' > Annotation: mogene11stv1mmentrezg > >> dim(x.norm) > Features Samples > 21225 120 >> >> #Check Diet assignment >> x.norm$Diet > [1] chow hfd lfd chow hfd lfd chow hfd lfd chow hfd lfd lfd chow hfd > [16] lfd chow hfd lfd chow hfd lfd chow hfd chow chow chow chow lfd lfd > [31] lfd lfd hfd hfd hfd hfd chow chow chow chow lfd lfd lfd lfd hfd > [46] hfd hfd hfd chow chow chow chow lfd lfd lfd lfd hfd hfd hfd hfd > [61] chow chow chow chow lfd lfd lfd lfd hfd hfd hfd hfd chow chow chow > [76] chow lfd lfd lfd lfd hfd hfd hfd hfd chow chow chow chow lfd lfd > [91] lfd lfd hfd hfd hfd hfd chow chow chow chow lfd lfd lfd lfd hfd > [106] hfd hfd hfd chow chow chow chow lfd lfd lfd lfd hfd hfd hfd hfd > Levels: chow hfd lfd >> >> str(x.norm) > <<snip> > ..@ phenoData :Formal class 'AnnotatedDataFrame' [package "Biobase"] with 4 slots > .. .. ..@ varMetadata :'data.frame': 5 obs. of 1 variable: > .. .. .. ..$ labelDescription: chr [1:5] NA NA NA NA ... > .. .. ..@ data :'data.frame': 120 obs. of 5 variables: > .. .. .. ..$ Simil : Factor w/ 10 levels "i1","i10","i2",..: 1 1 3 1 1 3 1 1 3 1 ... > .. .. .. ..$ Diet : Factor w/ 3 levels "chow","hfd","lfd": 1 2 3 1 2 3 1 2 3 1 ... > .. .. .. ..$ Group : Factor w/ 30 levels "i10_chow","i10_hfd",..: 4 5 9 4 5 9 4 5 9 4 ... > .. .. .. ..$ Plate : Factor w/ 2 levels "G014","G020": 1 1 1 1 1 1 1 1 1 1 ... > .. .. .. ..$ Labeling: int [1:120] 3 1 2 1 1 2 1 1 2 1 ... > > So far, so good. > Now I would like to extract data of only the 40 chow samples by subsetting x.norm on variable 'Diet'. > >> # backup x.norm >> x.norm2<- x.norm >> >> #subset only chow samples >> x.norm<- x.norm2[,x.norm2$Diet=="chow"] >> dim(x.norm) > Features Samples > 21225 40 > > Subsetting samples seem to go OK... >> #Again check Diet assigment >> x.norm$Diet > [1] chow chow chow chow chow chow chow chow chow chow chow chow chow chow chow > [16] chow chow chow chow chow chow chow chow chow chow chow chow chow chow chow > [31] chow chow chow chow chow chow chow chow chow chow > Levels: chow hfd lfd >> > > ^^ why are there still 3 levels; i expected only one level, namely "chow" > >> str(x.norm) > <<snip> > ..@ phenoData :Formal class 'AnnotatedDataFrame' [package "Biobase"] with 4 slots > .. .. ..@ varMetadata :'data.frame': 5 obs. of 1 variable: > .. .. .. ..$ labelDescription: chr [1:5] NA NA NA NA ... > .. .. ..@ data :'data.frame': 40 obs. of 5 variables: > .. .. .. ..$ Simil : Factor w/ 10 levels "i1","i10","i2",..: 1 1 1 1 3 3 3 3 4 4 ... > .. .. .. ..$ Diet : Factor w/ 3 levels "chow","hfd","lfd": 1 1 1 1 1 1 1 1 1 1 ... > .. .. .. ..$ Group : Factor w/ 30 levels "i10_chow","i10_hfd",..: 4 4 4 4 7 7 7 7 10 10 ... > .. .. .. ..$ Plate : Factor w/ 2 levels "G014","G020": 1 1 1 1 1 1 1 1 2 2 ... > .. .. .. ..$ Labeling: int [1:40] 3 1 1 1 2 2 2 2 3 3 ... > > ^^ idem, why are for all variables the 'original' levels reported and not the subset ones? > > > > >> sessionInfo() > R version 2.14.0 (2011-10-31) > Platform: x86_64-unknown-linux-gnu (64-bit) > > locale: > [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C > [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 > [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 > [7] LC_PAPER=C LC_NAME=C > [9] LC_ADDRESS=C LC_TELEPHONE=C > [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C > > attached base packages: > [1] stats graphics grDevices utils datasets methods base > > other attached packages: > [1] SpeCond_1.8.0 RColorBrewer_1.0-5 > [3] hwriter_1.3 fields_6.6.3 > [5] spam_0.27-0 mclust_3.4.11 > [7] mogene11stv1mmentrezgcdf_14.1.0 affyPLM_1.30.0 > [9] preprocessCore_1.16.0 gcrma_2.26.0 > [11] affy_1.33.2 Biobase_2.14.0 > [13] BiocGenerics_0.1.3 > > loaded via a namespace (and not attached): > [1] affyio_1.22.0 BiocInstaller_1.2.1 Biostrings_2.22.0 > [4] IRanges_1.12.5 splines_2.14.0 tools_2.14.0 > [7] zlibbioc_1.0.0 >> > > > --------------------------------------------------------- > Guido Hooiveld, PhD > Nutrition, Metabolism& Genomics Group > Division of Human Nutrition > Wageningen University > Biotechnion, Bomenweg 2 > NL-6703 HD Wageningen > the Netherlands > tel: (+)31 317 485788 > fax: (+)31 317 483342 > email: guido.hooiveld at wur.nl > internet: http://nutrigene.4t.com > http://scholar.google.com/citations?user=qFHaMnoAAAAJ > http://www.researcherid.com/rid/F-4912-2010 > > > [[alternative HTML version deleted]] > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor -- Computational Biology Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N. PO Box 19024 Seattle, WA 98109 Location: M1-B861 Telephone: 206 667-2793

ADD COMMENT • link 14.0 years ago Martin Morgan 25k

0

Entering edit mode

Oosting, J. PATH ▴ 550

@oosting-j-path-412

Last seen 11.3 years ago

An alternative would be to read in your phenodata with as.is=TRUE. Then all variables in the dataframe will be vectors. You can generate the factors when you need them ie. When constructing a model for the analysis. pheno <- read.delim(file="A213_metadata.txt", row.names=1,as.is=TRUE) Jan I think the as.data.frame() is superfluous. The result of read.delim is already a dataframe > > # read data & normalize > >library(affyPLM) > > >pheno <- read.delim(file="A213_metadata.txt", row.names=1) > >affy.data <- ReadAffy(cdfname="mogene11stv1mmentrezg", > phenoData=as.data.frame(pheno)) > >

ADD COMMENT • link 14.0 years ago Oosting, J. PATH ▴ 550

Login before adding your answer.