Unambiguously mapping of affy IDs to gene symbols using hgu133plus2.db

0

Entering edit mode

Christian Ruckert ▴ 170

@christian-ruckert-3294

Last seen 4.9 years ago

Germany

Hi, I am doing some mapping of affymetrix probeset IDs to gene symbols using package hgu133plus2.db. As the following example illustrates, each of the 40686 mapped probesets maps to exactly one gene symbol. > library("hgu133plus2.db") > x <- hgu133plus2SYMBOL > Llength(x) [1] 54675 > count.mappedkeys(x) [1] 40686 > head(nhit(x)) 1007_s_at 1053_at 117_at 121_at 1255_g_at 1294_at 1 1 1 1 1 1 > table(nhit(x)) 0 1 13989 40686 Am I correct, that annotation with gene symbol is only included in the package if it is unambiguously? For example > x[["203074_at"]] [1] NA But netaffx and biomart return: ANXA8, ANXA8L1, ANXA8L2 If doing a mapping between protein and gene expression arrays based on gene symbols, can results be improved using biomart instead of the annotation packages? Christian > sessionInfo() R version 2.11.0 (2010-04-22) x86_64-pc-linux-gnu locale: [1] C attached base packages: [1] stats graphics grDevices utils datasets methods base other attached packages: [1] hgu133plus2.db_2.4.1 org.Hs.eg.db_2.4.1 RSQLite_0.9-1 [4] DBI_0.2-5 AnnotationDbi_1.10.1 Biobase_2.8.0 loaded via a namespace (and not attached): [1] tools_2.11.0

Annotation hgu133plus2 biomaRt Annotation hgu133plus2 biomaRt • 3.6k views

ADD COMMENT • link updated 13.6 years ago by Marc Carlson ★ 7.2k • written 13.6 years ago by Christian Ruckert ▴ 170

0

Entering edit mode

Marc Carlson ★ 7.2k

@marc-carlson-2264

Last seen 7.7 years ago

United States

Hi Christian, What appears to be a simple mapping from probesets to gene symbols is actually slightly more complex. Behind the scenes, the annotation package has data to map the relationship from probesets to gene IDs, and also the relationship from gene IDs to gene symbols. This is important because there can be many probesets that map to a single gene, there can be many genes that map to a single probeset and there can be many gene symbols that map to a single gene. Therefore there are two relationships here, the 1st is potentially many (probes) to many (genes), and the second is many (symbols) to one (gene). Why then does it look simpler than that? In the annotation packages, (by default), we hide probesets that map to more than one gene. This is because most of the time, you probably don't want anything to do with probes that are not specific. But on the off chance that you really want to see those, you can expose them using the toggleProbes() method. So usually the 1st relationship is actually many (probesets) to one (gene). And in the SYMBOL mappings, the only gene symbol we expose is the most standard one. If you want the other gene symbols that are associated with a particular entrez gene ID, then you would have to use the ALIAS2PROBE mapping. So this second relationship is also normally simplified somewhat for you, from a "many to one" down to a "one to one". Because gene symbols are not guaranteed to be unique, (sometimes the same symbol is used as an alias for multiple different genes), I would strongly urge you to NEVER use them as actual IDs. Instead if you have to use them, they should always be the last piece of data attached to a work flow. So whether you decide to use the annotation packages or biomaRt, you will require a different strategy for matching up IDs than using gene symbols. In short, any sort of "joining" operation that uses gene symbols as keys is unsafe and should never be done. Marc On 10/01/2010 03:10 AM, Christian Ruckert wrote: > Hi, > I am doing some mapping of affymetrix probeset IDs to gene symbols > using package hgu133plus2.db. > > As the following example illustrates, each of the 40686 mapped > probesets maps to exactly one gene symbol. > > > library("hgu133plus2.db") > > x <- hgu133plus2SYMBOL > > Llength(x) > [1] 54675 > > count.mappedkeys(x) > [1] 40686 > > > head(nhit(x)) > 1007_s_at 1053_at 117_at 121_at 1255_g_at 1294_at > 1 1 1 1 1 1 > > > table(nhit(x)) > > 0 1 > 13989 40686 > > > Am I correct, that annotation with gene symbol is only included in the > package if it is unambiguously? > > For example > > x[["203074_at"]] > [1] NA > > But netaffx and biomart return: > ANXA8, ANXA8L1, ANXA8L2 > > If doing a mapping between protein and gene expression arrays based on > gene symbols, can results be improved using biomart instead of the > annotation packages? > > Christian > > > > sessionInfo() > R version 2.11.0 (2010-04-22) > x86_64-pc-linux-gnu > > locale: > [1] C > > attached base packages: > [1] stats graphics grDevices utils datasets methods base > > other attached packages: > [1] hgu133plus2.db_2.4.1 org.Hs.eg.db_2.4.1 RSQLite_0.9-1 > [4] DBI_0.2-5 AnnotationDbi_1.10.1 Biobase_2.8.0 > > loaded via a namespace (and not attached): > [1] tools_2.11.0 > > _______________________________________________ > Bioconductor mailing list > Bioconductor at stat.math.ethz.ch > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor

ADD COMMENT • link 13.6 years ago Marc Carlson ★ 7.2k

0

Entering edit mode

James W. MacDonald 65k

@james-w-macdonald-5106

Last seen 1 minute ago

United States

Hi Christian, On 10/1/2010 6:10 AM, Christian Ruckert wrote: > Hi, > I am doing some mapping of affymetrix probeset IDs to gene symbols using > package hgu133plus2.db. > > As the following example illustrates, each of the 40686 mapped probesets > maps to exactly one gene symbol. Yes, this was a design change of (maybe) two releases ago. The default is to only expose unambiguous mappings. This behavior can be modified using the toggleProbes() function. > table(nhit(hgu95av2SYMBOL)) 0 1 901 11724 > table(nhit(toggleProbes(hgu95av2SYMBOL, "all"))) 0 1 2 3 4 5 6 7 493 11724 297 53 22 4 10 4 8 9 10 11 12 14 20 21 4 2 2 1 1 1 2 4 22 1 > table(nhit(toggleProbes(hgu95av2SYMBOL, "multiple"))) 0 2 3 4 5 6 7 8 12217 297 53 22 4 10 4 4 9 10 11 12 14 20 21 22 2 2 1 1 1 2 4 1 See ?toggleProbes for more information. Best, Jim > > > library("hgu133plus2.db") > > x <- hgu133plus2SYMBOL > > Llength(x) > [1] 54675 > > count.mappedkeys(x) > [1] 40686 > > > head(nhit(x)) > 1007_s_at 1053_at 117_at 121_at 1255_g_at 1294_at > 1 1 1 1 1 1 > > > table(nhit(x)) > > 0 1 > 13989 40686 > > > Am I correct, that annotation with gene symbol is only included in the > package if it is unambiguously? > > For example > > x[["203074_at"]] > [1] NA > > But netaffx and biomart return: > ANXA8, ANXA8L1, ANXA8L2 > > If doing a mapping between protein and gene expression arrays based on > gene symbols, can results be improved using biomart instead of the > annotation packages? > > Christian > > > > sessionInfo() > R version 2.11.0 (2010-04-22) > x86_64-pc-linux-gnu > > locale: > [1] C > > attached base packages: > [1] stats graphics grDevices utils datasets methods base > > other attached packages: > [1] hgu133plus2.db_2.4.1 org.Hs.eg.db_2.4.1 RSQLite_0.9-1 > [4] DBI_0.2-5 AnnotationDbi_1.10.1 Biobase_2.8.0 > > loaded via a namespace (and not attached): > [1] tools_2.11.0 > > _______________________________________________ > Bioconductor mailing list > Bioconductor at stat.math.ethz.ch > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor -- James W. MacDonald, M.S. Biostatistician Douglas Lab University of Michigan Department of Human Genetics 5912 Buhl 1241 E. Catherine St. Ann Arbor MI 48109-5618 734-615-7826 ********************************************************** Electronic Mail is not secure, may not be read every day, and should not be used for urgent or sensitive issues

ADD COMMENT • link 13.6 years ago James W. MacDonald 65k

0

Entering edit mode

Benjamin Otto ▴ 830

@benjamin-otto-1519

Last seen 9.6 years ago

Hi Christian, that's interesting, I remember that I used to fumble around a little bit when using these annotation packages because of the multiple mappings for some of the IDs. Maybe the way these multiple hits are treated has changed in last versions of the db packages. However, here are two points: 1) What I currently do is using some kind of a hybrid annotation table between the annotation delivered by the hgu133plus2.db mapping (some older version) and additional manual annotation via biomart. Biomart certainly has the advantage, that it should be more up to date than these packages ... at least to a certain degree. 2) If you decide to use biomart (solemnly or in combination with something else) for your annotation: Save your annotation table where you can find it later so you can work with a consistent table throughout the project. 3) If you are just at the beginning of your project and are wondering how to treat not so unique IDs or cases where several probesets encode for one gene: It might be a thought to have a look at the alternative mappings (providing cdf files) of Affyprobeminer oder the Brainarray mappings from Michigan. Hope that helps a little bit. regards Benjamin Am 01.10.2010 um 12:10 schrieb Christian Ruckert: > Hi, > I am doing some mapping of affymetrix probeset IDs to gene symbols using package hgu133plus2.db. > > As the following example illustrates, each of the 40686 mapped probesets maps to exactly one gene symbol. > > > library("hgu133plus2.db") > > x <- hgu133plus2SYMBOL > > Llength(x) > [1] 54675 > > count.mappedkeys(x) > [1] 40686 > > > head(nhit(x)) > 1007_s_at 1053_at 117_at 121_at 1255_g_at 1294_at > 1 1 1 1 1 1 > > > table(nhit(x)) > > 0 1 > 13989 40686 > > > Am I correct, that annotation with gene symbol is only included in the package if it is unambiguously? > > For example > > x[["203074_at"]] > [1] NA > > But netaffx and biomart return: > ANXA8, ANXA8L1, ANXA8L2 > > If doing a mapping between protein and gene expression arrays based on gene symbols, can results be improved using biomart instead of the annotation packages? > > Christian > > > > sessionInfo() > R version 2.11.0 (2010-04-22) > x86_64-pc-linux-gnu > > locale: > [1] C > > attached base packages: > [1] stats graphics grDevices utils datasets methods base > > other attached packages: > [1] hgu133plus2.db_2.4.1 org.Hs.eg.db_2.4.1 RSQLite_0.9-1 > [4] DBI_0.2-5 AnnotationDbi_1.10.1 Biobase_2.8.0 > > loaded via a namespace (and not attached): > [1] tools_2.11.0 > > _______________________________________________ > Bioconductor mailing list > Bioconductor at stat.math.ethz.ch > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor > ___________________________________________ Benjamin Otto, PhD University Medical Center Hamburg-Eppendorf Institute For Clinical Chemistry / Central Laboratories Campus Forschung N27 Martinistr. 52, D-20246 Hamburg Tel.: +49 40 7410 51908 Fax.: +49 40 7410 54971 ___________________________________________ -- Pflichtangaben gem?? Gesetz ?ber elektronische Handelsregister und Genossenschaftsregister sowie das Unternehmensregister (EHUG): Universit?tsklinikum Hamburg-Eppendorf K?rperschaft des ?ffentlichen Rechts Gerichtsstand: Hamburg Vorstandsmitglieder: Prof. Dr. J?rg F. Debatin (Vorsitzender) Dr. Alexander Kirstein Joachim Pr?l? Prof. Dr. Dr. Uwe Koch-Gromus

ADD COMMENT • link 13.6 years ago Benjamin Otto ▴ 830

Login before adding your answer.