justRMA (package Affy) messing up phenoData when passing an AnnotatedDataFrame

0

Entering edit mode

Arnaud Amzallag ▴ 100

@arnaud-amzallag-4471

Last seen 7.1 years ago

Hello to all, I think there is a problem with a certain call of justRMA (see below); it mixes the sample names of pData but keeps the annotation the same, effectively miss labeling samples without returning a single error or warning. > phenoData <- read.AnnotatedDataFrame(paste(prodir, "sample_pheno.txt", sep="/"), sep="\t") > head(pData(phenoData)) Line Resistance Media Condition 28_MGH006_neg.CEL MGH006 Naive D10 Cont 30_MGH006_criz.CEL MGH006 Naive D10 Criz 300 21_MGH010_neg.CEL MGH010 Criz R10 Cont 15_MGH010_Criz.CEL MGH010 Criz R10 Criz 300 9_MGH021_neg.CEL MGH021-2 (pool) Criz (1269 et al) A10 Cont 4_MGH021_LDK.CEL MGH021-2 (pool) Criz (1269 et al) A10 LDK 300 > eset <- justRMA(phenoData=phenoData, celfile.path=celdir) > head(pData(eset)) Line Resistance Media Condition 10_MGH025_neg.CEL MGH006 Naive D10 Cont 11_MGH022_neg.CEL MGH006 Naive D10 Criz 300 12_MGH_049_criz.CEL MGH010 Criz R10 Cont 13_MGH025_criz_0530.CEL MGH010 Criz R10 Criz 300 14_MGH025_LDK.CEL MGH021-2 (pool) Criz (1269 et al) A10 Cont 15_MGH010_Criz.CEL MGH021-2 (pool) Criz (1269 et al) A10 LDK 300 ## IN CONTRAST, THIS WORKS CORRECTLY: eset <- justRMA(phenoData=paste(prodir, "sample_pheno.txt", sep="/"), celfile.path=celdir) I hope people appreciate this might lead to completely wrong results, and that it should be addressed in some way by the authors ? Best regards, Arnaud Amzallag, PhD Center for Cancer Research, Massachusetts General Hospital/Harvard Medical School > sessionInfo() R version 3.0.1 (2013-05-16) Platform: x86_64-unknown-linux-gnu (64-bit) locale: [1] C attached base packages: [1] parallel stats graphics grDevices utils datasets methods [8] base other attached packages: [1] hgu133plus2cdf_2.12.0 AnnotationDbi_1.22.6 affy_1.38.1 [4] Biobase_2.20.1 BiocGenerics_0.6.0 rj_1.1.3-1 loaded via a namespace (and not attached): [1] BiocInstaller_1.10.4 DBI_0.2-7 IRanges_1.18.4 [4] RSQLite_0.11.4 affyio_1.28.0 preprocessCore_1.22.0 [7] rj.gd_1.1.3-1 stats4_3.0.1 tools_3.0.1 [10] zlibbioc_1.6.0 [[alternative HTML version deleted]]

Annotation Cancer Annotation Cancer • 814 views

ADD COMMENT • link updated 10.1 years ago by James W. MacDonald 65k • written 10.1 years ago by Arnaud Amzallag ▴ 100

0

Entering edit mode

James W. MacDonald 65k

@james-w-macdonald-5106

Last seen 18 hours ago

United States

Hi Arnaud, Thanks for the bug report! This is fixed in the devel version, which will be released in the next week or so. Note that we cannot do much comparisons to ensure that the phenoData row.names make sense. In addition, the row.names for the phenoData object take precedence over filenames when creating the resulting ExpressionSet. Therefore, if we can't do simple matching between the row.names of the phenoData object, we just take you at your word that these are correctly ordered. Here is an example, using some random data: > df x y z Sample_1 1 Low A Sample_2 2 High B Sample_3 3 Low C Sample_4 4 High D Sample_5 5 Low E Sample_6 6 High F > list.celfiles() [1] "GSM38345.CEL.gz" "GSM38346.CEL.gz" "GSM38347.CEL.gz" "GSM38348.CEL.gz" [5] "GSM38349.CEL.gz" "GSM38350.CEL.gz" > eset <- justRMA(phenoData = df) Warning messages: 1: The cel file names are not conformable to the sample names provided in the phenoData object. Note that the sample names take precedence, so will be used in the sampleNames slot of the resulting ExpressionSet. Please ensure that the ordering of the phenoData object matches the order of the file names (use read.celfiles() to test), as no re- ordering is possible. So here, the row.names are Sample_1 through Sample_6, which are obviously very different from the GSM numbers for the celfiles. So we just assume that the sample names we get from row.names(df) are in the same order (and match with) what we get from list.celfiles(). > sampleNames(eset) [1] "Sample_1" "Sample_2" "Sample_3" "Sample_4" "Sample_5" "Sample_6" > pData(phenoData(eset)) x y z Sample_1 1 Low A Sample_2 2 High B Sample_3 3 Low C Sample_4 4 High D Sample_5 5 Low E Sample_6 6 High F So if the phenoData object is ordered incorrectly (e.g., if for instance Sample_1 is actually GSM38346.CEL.gz, or whatever), then the ExpressionSet will be borked as well, but there isn't a way to programmatically figure that out. But if the row.names are conformable with the file names, then we read in the files in the same order: > row.names(df) <- rev(list.celfiles()) > eset2 <- justRMA(phenoData = df) > sampleNames(eset2) [1] "GSM38350.CEL.gz" "GSM38349.CEL.gz" "GSM38348.CEL.gz" "GSM38347.CEL.gz" [5] "GSM38346.CEL.gz" "GSM38345.CEL.gz" > pData(phenoData(eset2)) x y z GSM38350.CEL.gz 1 Low A GSM38349.CEL.gz 2 High B GSM38348.CEL.gz 3 Low C GSM38347.CEL.gz 4 High D GSM38346.CEL.gz 5 Low E GSM38345.CEL.gz 6 High F > all.equal(exprs(eset), exprs(eset2)[,6:1], check.attributes = FALSE) [1] TRUE Best, Jim On 4/2/2014 11:20 AM, Arnaud wrote: > Hello to all, > > I think there is a problem with a certain call of justRMA (see below); it mixes the sample names of pData but keeps the annotation the same, effectively miss labeling samples without returning a single error or warning. > >> phenoData <- read.AnnotatedDataFrame(paste(prodir, "sample_pheno.txt", sep="/"), sep="\t") >> head(pData(phenoData)) > Line Resistance Media Condition > 28_MGH006_neg.CEL MGH006 Naive D10 Cont > 30_MGH006_criz.CEL MGH006 Naive D10 Criz 300 > 21_MGH010_neg.CEL MGH010 Criz R10 Cont > 15_MGH010_Criz.CEL MGH010 Criz R10 Criz 300 > 9_MGH021_neg.CEL MGH021-2 (pool) Criz (1269 et al) A10 Cont > 4_MGH021_LDK.CEL MGH021-2 (pool) Criz (1269 et al) A10 LDK 300 > > >> eset <- justRMA(phenoData=phenoData, celfile.path=celdir) >> head(pData(eset)) > Line Resistance Media Condition > 10_MGH025_neg.CEL MGH006 Naive D10 Cont > 11_MGH022_neg.CEL MGH006 Naive D10 Criz 300 > 12_MGH_049_criz.CEL MGH010 Criz R10 Cont > 13_MGH025_criz_0530.CEL MGH010 Criz R10 Criz 300 > 14_MGH025_LDK.CEL MGH021-2 (pool) Criz (1269 et al) A10 Cont > 15_MGH010_Criz.CEL MGH021-2 (pool) Criz (1269 et al) A10 LDK 300 > > ## IN CONTRAST, THIS WORKS CORRECTLY: > eset <- justRMA(phenoData=paste(prodir, "sample_pheno.txt", sep="/"), celfile.path=celdir) > > > I hope people appreciate this might lead to completely wrong results, and that it should be addressed in some way by the authors ? > > Best regards, > > Arnaud Amzallag, PhD > Center for Cancer Research, Massachusetts General Hospital/Harvard Medical School > >> sessionInfo() > R version 3.0.1 (2013-05-16) > Platform: x86_64-unknown-linux-gnu (64-bit) > > locale: > [1] C > > attached base packages: > [1] parallel stats graphics grDevices utils datasets methods > [8] base > > other attached packages: > [1] hgu133plus2cdf_2.12.0 AnnotationDbi_1.22.6 affy_1.38.1 > [4] Biobase_2.20.1 BiocGenerics_0.6.0 rj_1.1.3-1 > > loaded via a namespace (and not attached): > [1] BiocInstaller_1.10.4 DBI_0.2-7 IRanges_1.18.4 > [4] RSQLite_0.11.4 affyio_1.28.0 preprocessCore_1.22.0 > [7] rj.gd_1.1.3-1 stats4_3.0.1 tools_3.0.1 > [10] zlibbioc_1.6.0 > > [[alternative HTML version deleted]] > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor -- James W. MacDonald, M.S. Biostatistician University of Washington Environmental and Occupational Health Sciences 4225 Roosevelt Way NE, # 100 Seattle WA 98105-6099

ADD COMMENT • link 10.1 years ago James W. MacDonald 65k

Login before adding your answer.