mis-matched gene symbols and entrez ID in biomaRt
2
0
Entering edit mode
Wendy Qiao ▴ 360
@wendy-qiao-4501
Last seen 9.6 years ago
Hi all, I am converting the HGNC symbols from an Illumina human array to Entrez ID using biomaRt. I found that there are some gene symbols are matched to many Entrez IDs, and vice versa. I am wondering if how to solve the problem, so one gene symbol is only matched to one Entrez ID. Or is there any other package that I can use for matching gene symbols to Entrez IDs. Thank you in advance. Wendy ===== In the following example, BAGE2, 3, 4 and 5 are matched to 85316 and 85317 which are the Entrez IDs of BAGE5 and BAGE4, respectively. library('biomaRt') ensembl=useMart("ensembl_mart_51",dataset="hsapiens_gene_ensembl",arch ive=TRUE) Entrez<-getBM(attributes=c("hgnc_symbol","entrezgene"),filters="hgnc_s ymbol",values=GeneList,mart=ensembl) # class(GeneList) = factor Entrez[1:20,] hgnc_symbol entrezgene 1 ZFP62 92379 2 C9orf169 375791 3 FAM72D 653573 4 HMX1 NA 5 HMX1 3166 6 ZFP62 NA 7 RSPO4 343637 8 DOC2B 8447 9 C8orf42 157695 10 TTTY8 NA 11 A26C3 NA 12 BAGE5 85316 13 BAGE4 85316 14 BAGE3 85316 15 BAGE2 85316 16 BAGE5 85317 17 BAGE4 85317 18 BAGE3 85317 19 BAGE2 85317 20 NBR1 4077 [[alternative HTML version deleted]]
biomaRt biomaRt • 2.7k views
ADD COMMENT
0
Entering edit mode
@steve-lianoglou-2771
Last seen 13 months ago
United States
Hi, On Tue, Sep 6, 2011 at 11:24 PM, Wendy Qiao <wendy2.qiao at="" gmail.com=""> wrote: > Hi all, > > I am converting the HGNC symbols from an Illumina human array to Entrez ID > using biomaRt. I found that there are some gene symbols are matched to many > Entrez IDs, and vice versa. I am wondering if how to solve the problem, so > one gene symbol is only matched to one Entrez ID. Or is there any other > package that I can use for matching gene symbols to Entrez IDs. Thank you in > advance. > > Wendy > > ===== > In the following example, BAGE2, 3, 4 and 5 are matched to 85316 and 85317 > which are the Entrez IDs of BAGE5 and BAGE4, respectively. Not sure why that's happening (out of curiosity, is ensembl_mart_51 an older version of the db(?) -- I hardly ever use biomart, it seems) Anyway, seems like using the org.Hs.eg.db package would be ok: R> library(org.Hs.eg.db) R> mget(paste("BAGE", 2:5, sep=""), revmap(org.Hs.egSYMBOL), ifnotfound=NA) $BAGE2 [1] "85319" $BAGE3 [1] "85318" $BAGE4 [1] "85317" $BAGE5 [1] "85316" ... and you get the added bonus of not having to fire your query "over the wire". HTH, -steve > > library('biomaRt') > ensembl=useMart("ensembl_mart_51",dataset="hsapiens_gene_ensembl",ar chive=TRUE) > Entrez<-getBM(attributes=c("hgnc_symbol","entrezgene"),filters="hgnc _symbol",values=GeneList,mart=ensembl) > # class(GeneList) = factor > > Entrez[1:20,] > ? hgnc_symbol entrezgene > 1 ? ? ? ?ZFP62 ? ? ?92379 > 2 ? ? C9orf169 ? ? 375791 > 3 ? ? ? FAM72D ? ? 653573 > 4 ? ? ? ? HMX1 ? ? ? ? NA > 5 ? ? ? ? HMX1 ? ? ? 3166 > 6 ? ? ? ?ZFP62 ? ? ? ? NA > 7 ? ? ? ?RSPO4 ? ? 343637 > 8 ? ? ? ?DOC2B ? ? ? 8447 > 9 ? ? ?C8orf42 ? ? 157695 > 10 ? ? ? TTTY8 ? ? ? ? NA > 11 ? ? ? A26C3 ? ? ? ? NA > 12 ? ? ? BAGE5 ? ? ?85316 > 13 ? ? ? BAGE4 ? ? ?85316 > 14 ? ? ? BAGE3 ? ? ?85316 > 15 ? ? ? BAGE2 ? ? ?85316 > 16 ? ? ? BAGE5 ? ? ?85317 > 17 ? ? ? BAGE4 ? ? ?85317 > 18 ? ? ? BAGE3 ? ? ?85317 > 19 ? ? ? BAGE2 ? ? ?85317 > 20 ? ? ? ?NBR1 ? ? ? 4077 > > ? ? ? ?[[alternative HTML version deleted]] > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor > -- Steve Lianoglou Graduate Student: Computational Systems Biology ?| Memorial Sloan-Kettering Cancer Center ?| Weill Medical College of Cornell University Contact Info: http://cbio.mskcc.org/~lianos/contact
ADD COMMENT
0
Entering edit mode
@iain-gallagher-2532
Last seen 8.7 years ago
United Kingdom
Hi Wendy The version of ensembl you're using is very old. The current database is at 63. Identifiers (even entrez / ENSEMBL IDs) come and go as knowledge of the genome changes. Some identifers stay in the databases but get tagged as 'retired'. If the number of mismatches is low you can sort this out manually using the Web based Entrez Gene query system. If it's a large number then the e.g. org.Hs.eg.db packages may help (although you may lose a few genes because of retired IDs). symbols <- c('BAGE2', 'BAGE3,' 'BAGE4', 'BAGE5') EGIDS <- unlist(mget(symbols, org.Hs.egSYMBOL2EG, ifnotfound = NA)) Another handy package is limma which has the alias2SymbolTable function so you can convert your list of symbols (which may contain a mixture of official symbols and 'alias' symbols) to official symbols only: e.g. symbols <- c('BAGE2', 'BAGE3,' 'BAGE4', 'BAGE5') symbolsOfficial <- alias2SymbolTable(symbols,species="Hs") Note that this example may just return the same symbols... I haven't run the code. You might want to run this before using the org.Hs.eg.db package above to make sure all your symbols are official. Best iain --- On Wed, 7/9/11, Wendy Qiao <wendy2.qiao at="" gmail.com=""> wrote: > From: Wendy Qiao <wendy2.qiao at="" gmail.com=""> > Subject: [BioC] mis-matched gene symbols and entrez ID in biomaRt > To: bioconductor at r-project.org > Date: Wednesday, 7 September, 2011, 4:24 > Hi all, > > I am converting the HGNC symbols from an Illumina human > array to Entrez ID > using biomaRt. I found that there are some gene symbols are > matched to many > Entrez IDs, and vice versa. I am wondering if how to solve > the problem, so > one gene symbol is only matched to one Entrez ID. Or is > there any other > package that I can use for matching gene symbols to Entrez > IDs. Thank you in > advance. > > Wendy > > ===== > In the following example, BAGE2, 3, 4 and 5 are matched to > 85316 and 85317 > which are the Entrez IDs of BAGE5 and BAGE4, respectively. > > library('biomaRt') > ensembl=useMart("ensembl_mart_51",dataset="hsapiens_gene_ensembl",ar chive=TRUE) > Entrez<-getBM(attributes=c("hgnc_symbol","entrezgene"),filters="hgnc _symbol",values=GeneList,mart=ensembl) > # class(GeneList) = factor > > Entrez[1:20,] > ???hgnc_symbol entrezgene > 1? ? ? ? ZFP62? ? ? > 92379 > 2? ???C9orf169? > ???375791 > 3? ? ???FAM72D? > ???653573 > 4? ? ? ???HMX1? ? > ? ???NA > 5? ? ? ???HMX1? ? > ???3166 > 6? ? ? ? ZFP62? ? ? > ???NA > 7? ? ? ? RSPO4? > ???343637 > 8? ? ? ? DOC2B? ? > ???8447 > 9? ? ? C8orf42? > ???157695 > 10? ? ???TTTY8? ? ? > ???NA > 11? ? ???A26C3? ? ? > ???NA > 12? ? ???BAGE5? ? ? > 85316 > 13? ? ???BAGE4? ? ? > 85316 > 14? ? ???BAGE3? ? ? > 85316 > 15? ? ???BAGE2? ? ? > 85316 > 16? ? ???BAGE5? ? ? > 85317 > 17? ? ???BAGE4? ? ? > 85317 > 18? ? ???BAGE3? ? ? > 85317 > 19? ? ???BAGE2? ? ? > 85317 > 20? ? ? ? NBR1? ? > ???4077 > > ??? [[alternative HTML version deleted]] > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor >
ADD COMMENT
0
Entering edit mode
Biomart should be the best choice among all ID converters provided we choose right datbase. It is also  user friendly. Vasu --- On Wed, 9/7/11, Iain Gallagher <iaingallagher@btopenworld.com> wrote: From: Iain Gallagher <iaingallagher@btopenworld.com> Subject: Re: [BioC] mis-matched gene symbols and entrez ID in biomaRt To: bioconductor@r-project.org, "Wendy Qiao" <wendy2.qiao@gmail.com> Date: Wednesday, September 7, 2011, 3:53 AM Hi Wendy The version of ensembl you're using is very old. The current database is at 63. Identifiers (even entrez / ENSEMBL IDs) come and go as knowledge of the genome changes. Some identifers stay in the databases but get tagged as 'retired'. If the number of mismatches is low you can sort this out manually using the Web based Entrez Gene query system. If it's a large number then the e.g. org.Hs.eg.db packages may help (although you may lose a few genes because of retired IDs). symbols <- c('BAGE2', 'BAGE3,' 'BAGE4', 'BAGE5') EGIDS <- unlist(mget(symbols, org.Hs.egSYMBOL2EG, ifnotfound = NA)) Another handy package is limma which has the alias2SymbolTable function so you can convert your list of symbols (which may contain a mixture of official symbols and 'alias' symbols) to official symbols only: e.g. symbols <- c('BAGE2', 'BAGE3,' 'BAGE4', 'BAGE5') symbolsOfficial <- alias2SymbolTable(symbols,species="Hs") Note that this example may just return the same symbols... I haven't run the code. You  might want to run this before using the org.Hs.eg.db package above to make sure all your symbols are official. Best iain --- On Wed, 7/9/11, Wendy Qiao <wendy2.qiao@gmail.com> wrote: > From: Wendy Qiao <wendy2.qiao@gmail.com> > Subject: [BioC] mis-matched gene symbols and entrez ID in biomaRt > To: bioconductor@r-project.org > Date: Wednesday, 7 September, 2011, 4:24 > Hi all, > > I am converting the HGNC symbols from an Illumina human > array to Entrez ID > using biomaRt. I found that there are some gene symbols are > matched to many > Entrez IDs, and vice versa. I am wondering if how to solve > the problem, so > one gene symbol is only matched to one Entrez ID. Or is > there any other > package that I can use for matching gene symbols to Entrez > IDs. Thank you in > advance. > > Wendy > > ===== > In the following example, BAGE2, 3, 4 and 5 are matched to > 85316 and 85317 > which are the Entrez IDs of BAGE5 and BAGE4, respectively. > > library('biomaRt') > ensembl=useMart("ensembl_mart_51",dataset="hsapiens_gene_ensembl",ar chive=TRUE) > Entrez<-getBM(attributes=c("hgnc_symbol","entrezgene"),filters="hgnc _symbol",values=GeneList,mart=ensembl) > # class(GeneList) = factor > > Entrez[1:20,] >    hgnc_symbol entrezgene > 1        ZFP62 > 92379 > 2     C9orf169 >    375791 > 3       FAM72D >    653573 > 4         HMX1 >      NA > 5         HMX1 >    3166 > 6        ZFP62 >    NA > 7        RSPO4 >    343637 > 8        DOC2B >    8447 > 9      C8orf42 >    157695 > 10       TTTY8 >    NA > 11       A26C3 >    NA > 12       BAGE5 > 85316 > 13       BAGE4 > 85316 > 14       BAGE3 > 85316 > 15       BAGE2 > 85316 > 16       BAGE5 > 85317 > 17       BAGE4 > 85317 > 18       BAGE3 > 85317 > 19       BAGE2 > 85317 > 20        NBR1 >    4077 > >     [[alternative HTML version deleted]] > > _______________________________________________ > Bioconductor mailing list > Bioconductor@r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor > _______________________________________________ Bioconductor mailing list Bioconductor@r-project.org https://stat.ethz.ch/mailman/listinfo/bioconductor Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor [[alternative HTML version deleted]]
ADD REPLY
0
Entering edit mode
Hi Wendy, I prefer a faster way of converting by parsing text files. You can download the latest human_gene.info file from NCBI which contains HGNC symbols and corresponding Entrez GeneIDs and even more information. The information if accurate and also quick to get. Regards, Srikanth On Thu, Sep 8, 2011 at 11:22 AM, vasu punj <punjv@yahoo.com> wrote: > Biomart should be the best choice among all ID converters provided we > choose right datbase. It is also user friendly. > > Vasu > > --- On Wed, 9/7/11, Iain Gallagher <iaingallagher@btopenworld.com> wrote: > > > From: Iain Gallagher <iaingallagher@btopenworld.com> > Subject: Re: [BioC] mis-matched gene symbols and entrez ID in biomaRt > To: bioconductor@r-project.org, "Wendy Qiao" <wendy2.qiao@gmail.com> > Date: Wednesday, September 7, 2011, 3:53 AM > > > Hi Wendy > > The version of ensembl you're using is very old. The current database is at > 63. > > Identifiers (even entrez / ENSEMBL IDs) come and go as knowledge of the > genome changes. Some identifers stay in the databases but get tagged as > 'retired'. If the number of mismatches is low you can sort this out manually > using the Web based Entrez Gene query system. If it's a large number then > the e.g. org.Hs.eg.db packages may help (although you may lose a few genes > because of retired IDs). > > symbols <- c('BAGE2', 'BAGE3,' 'BAGE4', 'BAGE5') > EGIDS <- unlist(mget(symbols, org.Hs.egSYMBOL2EG, ifnotfound = NA)) > > Another handy package is limma which has the alias2SymbolTable function so > you can convert your list of symbols (which may contain a mixture of > official symbols and 'alias' symbols) to official symbols only: > > e.g. > symbols <- c('BAGE2', 'BAGE3,' 'BAGE4', 'BAGE5') > symbolsOfficial <- alias2SymbolTable(symbols,species="Hs") > > Note that this example may just return the same symbols... I haven't run > the code. > > You might want to run this before using the org.Hs.eg.db package above to > make sure all your symbols are official. > > Best > > iain > > --- On Wed, 7/9/11, Wendy Qiao <wendy2.qiao@gmail.com> wrote: > > > From: Wendy Qiao <wendy2.qiao@gmail.com> > > Subject: [BioC] mis-matched gene symbols and entrez ID in biomaRt > > To: bioconductor@r-project.org > > Date: Wednesday, 7 September, 2011, 4:24 > > Hi all, > > > > I am converting the HGNC symbols from an Illumina human > > array to Entrez ID > > using biomaRt. I found that there are some gene symbols are > > matched to many > > Entrez IDs, and vice versa. I am wondering if how to solve > > the problem, so > > one gene symbol is only matched to one Entrez ID. Or is > > there any other > > package that I can use for matching gene symbols to Entrez > > IDs. Thank you in > > advance. > > > > Wendy > > > > ===== > > In the following example, BAGE2, 3, 4 and 5 are matched to > > 85316 and 85317 > > which are the Entrez IDs of BAGE5 and BAGE4, respectively. > > > > library('biomaRt') > > > ensembl=useMart("ensembl_mart_51",dataset="hsapiens_gene_ensembl",ar chive=TRUE) > > > Entrez<-getBM(attributes=c("hgnc_symbol","entrezgene"),filters="hgnc _symbol",values=GeneList,mart=ensembl) > > # class(GeneList) = factor > > > > Entrez[1:20,] > > hgnc_symbol entrezgene > > 1 ZFP62 > > 92379 > > 2 C9orf169 > > 375791 > > 3 FAM72D > > 653573 > > 4 HMX1 > > NA > > 5 HMX1 > > 3166 > > 6 ZFP62 > > NA > > 7 RSPO4 > > 343637 > > 8 DOC2B > > 8447 > > 9 C8orf42 > > 157695 > > 10 TTTY8 > > NA > > 11 A26C3 > > NA > > 12 BAGE5 > > 85316 > > 13 BAGE4 > > 85316 > > 14 BAGE3 > > 85316 > > 15 BAGE2 > > 85316 > > 16 BAGE5 > > 85317 > > 17 BAGE4 > > 85317 > > 18 BAGE3 > > 85317 > > 19 BAGE2 > > 85317 > > 20 NBR1 > > 4077 > > > > [[alternative HTML version deleted]] > > > > _______________________________________________ > > Bioconductor mailing list > > Bioconductor@r-project.org > > https://stat.ethz.ch/mailman/listinfo/bioconductor > > Search the archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor > > > > _______________________________________________ > Bioconductor mailing list > Bioconductor@r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor > > [[alternative HTML version deleted]] > > > _______________________________________________ > Bioconductor mailing list > Bioconductor@r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor > -- Srinivas M. Srikanth Ph.D. Student Institute of Bioinformatics Discoverer, 7th Floor, International Technology Park, Bangalore, India Mob:+917259692031 [[alternative HTML version deleted]]
ADD REPLY

Login before adding your answer.

Traffic: 338 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6