mapping through org.Xx.eg.db packages

0

Entering edit mode

Iain Gallagher ▴ 930

@iain-gallagher-2532

Last seen 8.8 years ago

United Kingdom

Dear List I wonder is someone could shed some light on the following. Given a set of gene symbols I would like to retrieve different identifiers. Using the org.Xx.eg.db packages I can go about this by mapping through the EntrezIDs: # mapping through eg ids as package is eg id centric library(org.Hs.eg.db) syms <- c('ACTB', 'TNF', 'TGFB1') egID <- unlist(mget(syms, org.Hs.egSYMBOL2EG, ifnotfound=NA)) ensID <- unlist(mget(egID, org.Hs.egENSEMBL, ifnotfound=NA)) > ensID 60 71241 71242 71243 "ENSG00000075624" "ENSG00000204490" "ENSG00000206439" "ENSG00000223952" 71244 71245 71246 71247 "ENSG00000228321" "ENSG00000228849" "ENSG00000230108" "ENSG00000232810" 7040 "ENSG00000105329" > egID ACTB TNF TGFB1 "60" "7124" "7040" Now here I assumed that the names of the ensID object were the original EntrezIDs mapped from the symbols but because R does not handle duplicate names they are not - with renumbering for those EntrezIDs that have a plurality of matches (here 7124 becomes 71241, 71242 etc etc) This has caused me some confusion since each of these names is an actual Entrez ID - just not one I'm interested in. The same can happen when mapping from any ID that ends in a numeric part (eg Ensembl ids). It is useful to return a mapping showing the original identifier, the EntrezID mapped through and the required identifier so how could one reliably do this when mapping through e.g. Entrez IDs as in the method above (i.e. return the Entrez ID and Ensembl ID in one sweep)? I have tried using the SQL approach: dbCon <- org.Hs.eg_dbconn() sqlQuery <- 'SELECT * FROM genes, gene_info, ensembl WHERE genes._id = gene_info._id = ensembl._id;' result <- dbGetQuery(dbCon, sqlQuery) where one could filter the 'result' object with the symbols of interest but this query takes a long time to run. I know little SQL so that might be an issue! Best iain > sessionInfo() R version 2.13.2 (2011-09-30) Platform: x86_64-pc-linux-gnu (64-bit) locale: [1] LC_CTYPE=en_GB.utf8 LC_NUMERIC=C [3] LC_TIME=en_GB.utf8 LC_COLLATE=en_GB.utf8 [5] LC_MONETARY=C LC_MESSAGES=en_GB.utf8 [7] LC_PAPER=en_GB.utf8 LC_NAME=C [9] LC_ADDRESS=C LC_TELEPHONE=C [11] LC_MEASUREMENT=en_GB.utf8 LC_IDENTIFICATION=C attached base packages: [1] stats graphics grDevices utils datasets methods base other attached packages: [1] org.Hs.eg.db_2.4.6 RSQLite_0.9-4 DBI_0.2-5 [4] AnnotationDbi_1.14.1 Biobase_2.10.0

GO GO • 995 views

ADD COMMENT • link updated 12.6 years ago by Sean Davis 21k • written 12.6 years ago by Iain Gallagher ▴ 930

0

Entering edit mode

Sean Davis 21k

@sean-davis-490

Last seen 4 months ago

United States

On Thu, Oct 6, 2011 at 7:50 AM, Iain Gallagher <iaingallagher at="" btopenworld.com=""> wrote: > Dear List > > I wonder is someone could shed some light on the following. > > Given a set of gene symbols I would like to retrieve different identifiers. > > Using the org.Xx.eg.db packages I can go about this by mapping through the EntrezIDs: > > # mapping through eg ids as package is eg id centric > library(org.Hs.eg.db) > syms <- c('ACTB', 'TNF', 'TGFB1') > egID <- unlist(mget(syms, org.Hs.egSYMBOL2EG, ifnotfound=NA)) > ensID <- unlist(mget(egID, org.Hs.egENSEMBL, ifnotfound=NA)) > >> ensID > ? ? ? ? ? ? ? 60 ? ? ? ? ? ? 71241 ? ? ? ? ? ? 71242 ? ? ? ? ? ? 71243 > "ENSG00000075624" "ENSG00000204490" "ENSG00000206439" "ENSG00000223952" > ? ? ? ? ? ?71244 ? ? ? ? ? ? 71245 ? ? ? ? ? ? 71246 ? ? ? ? ? ? 71247 > "ENSG00000228321" "ENSG00000228849" "ENSG00000230108" "ENSG00000232810" > ? ? ? ? ? ? 7040 > "ENSG00000105329" > >> egID > ?ACTB ? ?TNF ?TGFB1 > ?"60" "7124" "7040" > > Now here I assumed that the names of the ensID object were the original EntrezIDs mapped from the symbols but because R does not handle duplicate names they are not - with renumbering for those EntrezIDs that have a plurality of matches (here 7124 becomes 71241, 71242 etc etc) > > This has caused me some confusion since each of these names is an actual Entrez ID - just not one I'm interested in. > > The same can happen when mapping from any ID that ends in a numeric part (eg Ensembl ids). > > It is useful to return a mapping showing the original identifier, the EntrezID mapped through and the required identifier so how could one reliably do this when mapping through e.g. Entrez IDs as in the method above (i.e. return the Entrez ID and Ensembl ID in one sweep)? > Hi, Ian. Just leave out the "unlist" from your code. > ensIDList <- mget(egID, org.Hs.egENSEMBL, ifnotfound=NA) > ensIDList $`60` [1] "ENSG00000075624" $`7124` [1] "ENSG00000204490" "ENSG00000206439" "ENSG00000223952" "ENSG00000228321" [5] "ENSG00000228849" "ENSG00000230108" "ENSG00000232810" $`7040` [1] "ENSG00000105329" Hope that helps. Sean > I have tried using the SQL approach: > > dbCon <- org.Hs.eg_dbconn() > sqlQuery <- 'SELECT * FROM genes, gene_info, ensembl WHERE genes._id = gene_info._id = ensembl._id;' > result <- dbGetQuery(dbCon, sqlQuery) > > where one could filter the 'result' object with the symbols of interest but this query takes a long time to run. I know little SQL so that might be an issue! > > Best > > iain > >> sessionInfo() > R version 2.13.2 (2011-09-30) > Platform: x86_64-pc-linux-gnu (64-bit) > > locale: > ?[1] LC_CTYPE=en_GB.utf8 ? ? ? LC_NUMERIC=C > ?[3] LC_TIME=en_GB.utf8 ? ? ? ?LC_COLLATE=en_GB.utf8 > ?[5] LC_MONETARY=C ? ? ? ? ? ? LC_MESSAGES=en_GB.utf8 > ?[7] LC_PAPER=en_GB.utf8 ? ? ? LC_NAME=C > ?[9] LC_ADDRESS=C ? ? ? ? ? ? ?LC_TELEPHONE=C > [11] LC_MEASUREMENT=en_GB.utf8 LC_IDENTIFICATION=C > > attached base packages: > [1] stats ? ? graphics ?grDevices utils ? ? datasets ?methods ? base > > other attached packages: > [1] org.Hs.eg.db_2.4.6 ? RSQLite_0.9-4 ? ? ? ?DBI_0.2-5 > [4] AnnotationDbi_1.14.1 Biobase_2.10.0 > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor >

ADD COMMENT • link 12.6 years ago Sean Davis 21k

0

Entering edit mode

On 10/06/2011 04:58 AM, Sean Davis wrote: > On Thu, Oct 6, 2011 at 7:50 AM, Iain Gallagher > <iaingallagher at="" btopenworld.com=""> wrote: >> Dear List >> >> I wonder is someone could shed some light on the following. >> >> Given a set of gene symbols I would like to retrieve different identifiers. >> >> Using the org.Xx.eg.db packages I can go about this by mapping through the EntrezIDs: >> >> # mapping through eg ids as package is eg id centric >> library(org.Hs.eg.db) >> syms<- c('ACTB', 'TNF', 'TGFB1') >> egID<- unlist(mget(syms, org.Hs.egSYMBOL2EG, ifnotfound=NA)) >> ensID<- unlist(mget(egID, org.Hs.egENSEMBL, ifnotfound=NA)) >> >>> ensID >> 60 71241 71242 71243 >> "ENSG00000075624" "ENSG00000204490" "ENSG00000206439" "ENSG00000223952" >> 71244 71245 71246 71247 >> "ENSG00000228321" "ENSG00000228849" "ENSG00000230108" "ENSG00000232810" >> 7040 >> "ENSG00000105329" >> >>> egID >> ACTB TNF TGFB1 >> "60" "7124" "7040" >> >> Now here I assumed that the names of the ensID object were the original EntrezIDs mapped from the symbols but because R does not handle duplicate names they are not - with renumbering for those EntrezIDs that have a plurality of matches (here 7124 becomes 71241, 71242 etc etc) >> >> This has caused me some confusion since each of these names is an actual Entrez ID - just not one I'm interested in. >> >> The same can happen when mapping from any ID that ends in a numeric part (eg Ensembl ids). >> >> It is useful to return a mapping showing the original identifier, the EntrezID mapped through and the required identifier so how could one reliably do this when mapping through e.g. Entrez IDs as in the method above (i.e. return the Entrez ID and Ensembl ID in one sweep)? >> > > Hi, Ian. Just leave out the "unlist" from your code. > >> ensIDList<- mget(egID, org.Hs.egENSEMBL, ifnotfound=NA) >> ensIDList > $`60` > [1] "ENSG00000075624" > > $`7124` > [1] "ENSG00000204490" "ENSG00000206439" "ENSG00000223952" "ENSG00000228321" > [5] "ENSG00000228849" "ENSG00000230108" "ENSG00000232810" > > $`7040` > [1] "ENSG00000105329" > > Hope that helps. I like to skin my cat (sorry, Zoro) as > egids = mappedLkeys(org.Hs.egALIAS2EG[syms]) > merge(toTable(org.Hs.egALIAS2EG[syms]), + toTable(org.Hs.egENSEMBL[egids])) gene_id alias_symbol ensembl_id 1 60 ACTB ENSG00000075624 3 7040 TGFB1 ENSG00000105329 4 7124 TNF ENSG00000204490 5 7124 TNF ENSG00000206439 6 7124 TNF ENSG00000223952 7 7124 TNF ENSG00000228321 8 7124 TNF ENSG00000228849 9 7124 TNF ENSG00000228978 10 7124 TNF ENSG00000230108 11 7124 TNF ENSG00000232810 it can require intermediate checks that, e.g., syms %in% mappedRkeys(org.Hs.egALIAS2SYM). The plan is that, in the next release, one could select(org.Hs.eg.db, egids, c("ENSEMBL", "SYMBOL")) without having to worry about NA keys or the multiple maps. Martin > > Sean > > >> I have tried using the SQL approach: >> >> dbCon<- org.Hs.eg_dbconn() >> sqlQuery<- 'SELECT * FROM genes, gene_info, ensembl WHERE genes._id = gene_info._id = ensembl._id;' >> result<- dbGetQuery(dbCon, sqlQuery) >> >> where one could filter the 'result' object with the symbols of interest but this query takes a long time to run. I know little SQL so that might be an issue! >> >> Best >> >> iain >> >>> sessionInfo() >> R version 2.13.2 (2011-09-30) >> Platform: x86_64-pc-linux-gnu (64-bit) >> >> locale: >> [1] LC_CTYPE=en_GB.utf8 LC_NUMERIC=C >> [3] LC_TIME=en_GB.utf8 LC_COLLATE=en_GB.utf8 >> [5] LC_MONETARY=C LC_MESSAGES=en_GB.utf8 >> [7] LC_PAPER=en_GB.utf8 LC_NAME=C >> [9] LC_ADDRESS=C LC_TELEPHONE=C >> [11] LC_MEASUREMENT=en_GB.utf8 LC_IDENTIFICATION=C >> >> attached base packages: >> [1] stats graphics grDevices utils datasets methods base >> >> other attached packages: >> [1] org.Hs.eg.db_2.4.6 RSQLite_0.9-4 DBI_0.2-5 >> [4] AnnotationDbi_1.14.1 Biobase_2.10.0 >> >> _______________________________________________ >> Bioconductor mailing list >> Bioconductor at r-project.org >> https://stat.ethz.ch/mailman/listinfo/bioconductor >> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor >> > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor -- Computational Biology Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N. PO Box 19024 Seattle, WA 98109 Location: M1-B861 Telephone: 206 667-2793

ADD REPLY • link 12.6 years ago Martin Morgan 25k

Login before adding your answer.