hgu133a annotation and discontinued Entrez IDs

0

Entering edit mode

David Fermin ▴ 20

@david-fermin-2845

Last seen 9.7 years ago

I discovered a problem in annotating some expression analysis results. I am using CEL files from the HG-U133A and, as an initial step after creating an rmaset, I filter out the probesets without Entrez gene IDs as follows: arrayset <- ReadAffy() rmaset <- rma(arrayset) entrezIds <- mget(featureNames(rmaset), envir = hgu133aENTREZID) haveEntrezId <- names(entrezIds)[sapply(entrezIds, function(x) !is.na (x))] numNoEntrezId <- length(featureNames(rmaset)) - length(haveEntrezId) rmaset <- rmaset[haveEntrezId, ] After doing my limma analysis I use aafTableAnn() to grab the data. I expect to get a list of probesets which all have annotation information. However, when I manually scanned the annotated table I discovered a number of probesets with an Entrez ID but no other annotation. Most of these, it turns out, have been discontinued as of a 2005 build of Entrez, some of which being mapped to other IDs and some being dropped altogether. See sessionInfo() below for the versions I am using. When I query "? hgu133a" I get the following hgu133a {hgu133a} The annotation package was built using a downloadable R package - AnnBuilder (download and build your own) from www.bioconductor.org using the following public data sources: Entrez Gene:ftp://ftp.ncbi.nlm.nih.gov/gene/DATA/. Built: Source data downloaded from Entrez Gene on Fri Aug 24 18:20:19 2007... [Package hgu133a version 2.0.1 Index] My question is, why am I getting probesets with discontinued Entrez IDs? Thank you for your help. Best Regards, David > sessionInfo() R version 2.6.2 (2008-02-08) i386-apple-darwin8.10.1 locale: en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8 attached base packages: [1] splines tools stats graphics grDevices utils datasets methods [9] base other attached packages: [1] GOstats_2.4.0 Category_2.4.0 RBGL_1.14.0 GO.db_2.0.2 [5] graph_1.16.1 genefilter_1.16.0 survival_2.34 annotate_1.16.1 [9] xtable_1.5-2 AnnotationDbi_1.0.6 RSQLite_0.6-8 DBI_0.2-4 [13] ALL_1.4.3 hgu133a_2.0.1 annaffy_1.10.1 KEGG_2.0.1 [17] GO_2.0.1 limma_2.12.0 affy_1.16.0 preprocessCore_1.0.0 [21] affyio_1.6.1 Biobase_1.16.3 loaded via a namespace (and not attached): [1] cluster_1.11.9

Annotation hgu133a limma Annotation hgu133a limma • 1.3k views

ADD COMMENT • link 15.9 years ago David Fermin ▴ 20

0

Entering edit mode

anna freni sterrantino ▴ 120

@anna-freni-sterrantino-2847

Last seen 9.7 years ago

Hi David , could be that you should get library(hgu133a.db) form your sessionInfo it seems that you have the older version. You should also update your R to 2.7 and get the Bioc package Release 2.2. Cheers A. ----- Messaggio originale ----- Da: David Fermin <ferm0007@umn.edu> A: bioconductor@stat.math.ethz.ch Inviato: GiovedÃ¬ 12 giugno 2008, 4:56:51 Oggetto: [BioC] hgu133a annotation and discontinued Entrez IDs I discovered a problem in annotating some expression analysis results. I am using CEL files from the HG-U133A and, as an initial step after creating an rmaset, I filter out the probesets without Entrez gene IDs as follows: arrayset <- ReadAffy() rmaset <- rma(arrayset) entrezIds <- mget(featureNames(rmaset), envir = hgu133aENTREZID) haveEntrezId <- names(entrezIds)[sapply(entrezIds, function(x) !is.na (x))] numNoEntrezId <- length(featureNames(rmaset)) - length(haveEntrezId) rmaset <- rmaset[haveEntrezId, ] After doing my limma analysis I use aafTableAnn() to grab the data. I expect to get a list of probesets which all have annotation information. However, when I manually scanned the annotated table I discovered a number of probesets with an Entrez ID but no other annotation. Most of these, it turns out, have been discontinued as of a 2005 build of Entrez, some of which being mapped to other IDs and some being dropped altogether. See sessionInfo() below for the versions I am using. When I query "? hgu133a" I get the following hgu133a {hgu133a} The annotation package was built using a downloadable R package - AnnBuilder (download and build your own) from www.bioconductor.org using the following public data sources: Entrez Gene:ftp://ftp.ncbi.nlm.nih.gov/gene/DATA/. Built: Source data downloaded from Entrez Gene on Fri Aug 24 18:20:19 2007... [Package hgu133a version 2.0.1 Index] My question is, why am I getting probesets with discontinued Entrez IDs? Thank you for your help. Best Regards, David > sessionInfo() R version 2.6.2 (2008-02-08) i386-apple-darwin8.10.1 locale: en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8 attached base packages: [1] splines tools stats graphics grDevices utils datasets methods [9] base other attached packages: [1] GOstats_2.4.0 Category_2.4.0 RBGL_1.14.0 GO.db_2.0.2 [5] graph_1.16.1 genefilter_1.16.0 survival_2.34 annotate_1.16.1 [9] xtable_1.5-2 AnnotationDbi_1.0.6 RSQLite_0.6-8 DBI_0.2-4 [13] ALL_1.4.3 hgu133a_2.0.1 annaffy_1.10.1 KEGG_2.0.1 [17] GO_2.0.1 limma_2.12.0 affy_1.16.0 preprocessCore_1.0.0 [21] affyio_1.6.1 Biobase_1.16.3 loaded via a namespace (and not attached): [1] cluster_1.11.9 _______________________________________________ Bioconductor mailing list Bioconductor@stat.math.ethz.ch https://stat.ethz.ch/mailman/listinfo/bioconductor Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor ___________________________________ pinione! http://www.ymailblogit.com/blog/ [[alternative HTML version deleted]]

ADD COMMENT • link 15.9 years ago anna freni sterrantino ▴ 120

0

Entering edit mode

David Fermin ▴ 20

@david-fermin-2845

Last seen 9.7 years ago

anna freni sterrantino <annafreni at="" ...=""> writes: > > Hi David , > could be that you should get > library(hgu133a.db) > form your sessionInfo it seems that you have the older version. > You should also update your R to 2.7 and get the Bioc package Release 2.2. > > Cheers > > A. > > > _______________________________________________ > Bioconductor mailing list > Bioconductor at ... > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor Thank you for your reply. I know there are newer versions available, but for the manuscript I'm preparing I would like to be consistent with version control amongst my analyses. The question remains, why am I getting probesets with antiquated Entrez IDs when the annotation package I used (see original message) from August 2007 should be adequately up to date so as to exclude these discontinued IDs? David

ADD COMMENT • link 15.9 years ago David Fermin ▴ 20

0

Entering edit mode

David Fermin wrote: > anna freni sterrantino <annafreni at="" ...=""> writes: > >> Hi David , >> could be that you should get >> library(hgu133a.db) >> form your sessionInfo it seems that you have the older version. >> You should also update your R to 2.7 and get the Bioc package Release 2.2. >> >> Cheers >> >> A. >> >> >> _______________________________________________ >> Bioconductor mailing list >> Bioconductor at ... >> https://stat.ethz.ch/mailman/listinfo/bioconductor >> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor > > > Thank you for your reply. I know there are newer versions available, but for the > manuscript I'm preparing I would like to be consistent with version control > amongst my analyses. > > The question remains, why am I getting probesets with antiquated Entrez IDs > when the annotation package I used (see original message) from August 2007 > should be adequately up to date so as to exclude these discontinued IDs? As you noted in your first post, the help page for this package says: The annotation package was built using a downloadable R package - AnnBuilder (download and build your own) from www.bioconductor.org using the following public data sources: Entrez Gene:ftp://ftp.ncbi.nlm.nih.gov/gene/DATA/. Built: Source data downloaded from Entrez Gene on Fri Aug 24 18:20:19 2007... In other words, we downloaded the data from Entrez Gene on that day and parsed from there, so any discontinued Entrez Gene IDs in that package were still in the data we downloaded and parsed. Our purpose here is to simply make annotation data available in a convenient form, not to check the data made available by NCBI and ensure it is correct (that is NCBI's job), so my recommendation would be to inform them of these errors. Best, Jim > > David > > _______________________________________________ > Bioconductor mailing list > Bioconductor at stat.math.ethz.ch > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor -- James W. MacDonald, M.S. Biostatistician Affymetrix and cDNA Microarray Core University of Michigan Cancer Center 1500 E. Medical Center Drive 7410 CCGC Ann Arbor MI 48109 734-647-5623

ADD REPLY • link 15.9 years ago James W. MacDonald 65k

Login before adding your answer.