help with protein IPI annotation mappings

0

Entering edit mode

Kimpel, Mark W ▴ 890

@kimpel-mark-w-727

Last seen 11.5 years ago

An embedded and charset-unspecified text was scrubbed... Name: not available Url: https://stat.ethz.ch/pipermail/bioconductor/attachments/20070104/ 4ab82f7a/attachment.pl

• 2.3k views

ADD COMMENT • link updated 19.1 years ago by Steffen Durinck ▴ 580 • written 19.1 years ago by Kimpel, Mark W ▴ 890

0

Entering edit mode

Steffen Durinck ▴ 580

@steffen-durinck-1780

Last seen 11.5 years ago

Hi Mark, I don't have an IPI id to test this but biomaRt should be able to do what you want. Here's how: library(biomaRt) ipiID = c("your","list","of","IPI","IDs") ensembl=useMart("ensembl", dataset="rnorvegicus_gene_ensembl") getBM(attributes=c("ipi","entezgene"), filters="ipi", values=ipiID, mart=ensembl) you could probably make the result of the query a little cleaner by: getBM(attributes=c("ipi","entezgene"), filters=c("ipi","with_ipi"), values=list(ids=ipiID,""), mart=ensembl) best, Steffen Kimpel, Mark William wrote: > I need to map a list of rat International Protein Index accession ids to > EntrezGene. The proteins have been identified using mass spectroscopy > and thus do not necessarily correspond to any particular affy chipset. > How would I do this in BioC? Can biomaRt handle this? > > > > Thanks, > > Mark > > > > Mark W. Kimpel MD > > > > > > Official Business Address: > > > > Department of Psychiatry > > Indiana University School of Medicine > > PR M116 > > Institute of Psychiatric Research > > 791 Union Drive > > Indianapolis, IN 46202 > > > > Preferred Mailing Address: > > > > 15032 Hunter Court > > Westfield, IN 46074 > > > > (317) 490-5129 Work, & Mobile > > > > (317) 663-0513 Home (no voice mail please) > > 1-(317)-536-2730 FAX > > > [[alternative HTML version deleted]] > > _______________________________________________ > Bioconductor mailing list > Bioconductor at stat.math.ethz.ch > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor > -- Steffen Durinck, Ph.D. Oncogenomics Section Pediatric Oncology Branch National Cancer Institute, National Institutes of Health URL: http://home.ccr.cancer.gov/oncology/oncogenomics/ Phone: 301-402-8103 Address: Advanced Technology Center, 8717 Grovemont Circle Gaithersburg, MD 20877

ADD COMMENT • link 19.1 years ago Steffen Durinck ▴ 580

0

Entering edit mode

Steffen, Your code to convert IPI to entrezgene ID's worked like charm. Now I have run into another problem. I have discovered that some of the ID's I need to map are GenBank ID's of the form (GI:XXXX). I have used listAttributes(ensembl) and cannot figure out which, if any correspond to the NCBI GI. A previous post in this list indicated that this should be possible, but I must be missing something. Thanks, Mark Mark W. Kimpel MD (317) 490-5129 Work, & Mobile (317) 663-0513 Home (no voice mail please) 1-(317)-536-2730 FAX -----Original Message----- From: Steffen Durinck [mailto:durincks@mail.nih.gov] Sent: Friday, January 05, 2007 8:57 AM To: Kimpel, Mark William Cc: bioconductor at stat.math.ethz.ch Subject: Re: [BioC] help with protein IPI annotation mappings Hi Mark, I don't have an IPI id to test this but biomaRt should be able to do what you want. Here's how: library(biomaRt) ipiID = c("your","list","of","IPI","IDs") ensembl=useMart("ensembl", dataset="rnorvegicus_gene_ensembl") getBM(attributes=c("ipi","entezgene"), filters="ipi", values=ipiID, mart=ensembl) you could probably make the result of the query a little cleaner by: getBM(attributes=c("ipi","entezgene"), filters=c("ipi","with_ipi"), values=list(ids=ipiID,""), mart=ensembl) best, Steffen Kimpel, Mark William wrote: > I need to map a list of rat International Protein Index accession ids to > EntrezGene. The proteins have been identified using mass spectroscopy > and thus do not necessarily correspond to any particular affy chipset. > How would I do this in BioC? Can biomaRt handle this? > > > > Thanks, > > Mark > > > > Mark W. Kimpel MD > > > > > > Official Business Address: > > > > Department of Psychiatry > > Indiana University School of Medicine > > PR M116 > > Institute of Psychiatric Research > > 791 Union Drive > > Indianapolis, IN 46202 > > > > Preferred Mailing Address: > > > > 15032 Hunter Court > > Westfield, IN 46074 > > > > (317) 490-5129 Work, & Mobile > > > > (317) 663-0513 Home (no voice mail please) > > 1-(317)-536-2730 FAX > > > [[alternative HTML version deleted]] > > _______________________________________________ > Bioconductor mailing list > Bioconductor at stat.math.ethz.ch > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor > -- Steffen Durinck, Ph.D. Oncogenomics Section Pediatric Oncology Branch National Cancer Institute, National Institutes of Health URL: http://home.ccr.cancer.gov/oncology/oncogenomics/ Phone: 301-402-8103 Address: Advanced Technology Center, 8717 Grovemont Circle Gaithersburg, MD 20877

ADD REPLY • link 19.1 years ago Kimpel, Mark W ▴ 890

0

Entering edit mode

Hi Mark, I quickly scanned the attributes and filters and it looks like you currently can not use genbank accession numbers with Ensembl. To be sure you could ask the Ensembl helpdesk: helpdesk at ensembl.org if genbank accession numbers are in their database and what the name of the corresponding filter is. If they don't have genbank ids you could ask them if there is a possibility to include genbank ids in future releases. Whatever information Ensembl makes available is retrievable through the biomaRt package and questions or suggestions related to the data present in Ensembl can be best addressed to their helpdesk. Make sure you let them know you are using the BioMart version of Ensembl. Cheers, Steffen Kimpel, Mark William wrote: > Steffen, > > Your code to convert IPI to entrezgene ID's worked like charm. Now I > have run into another problem. I have discovered that some of the ID's I > need to map are GenBank ID's of the form (GI:XXXX). I have used > listAttributes(ensembl) and cannot figure out which, if any correspond > to the NCBI GI. A previous post in this list indicated that this should > be possible, but I must be missing something. > > Thanks, > Mark > > Mark W. Kimpel MD > > > > (317) 490-5129 Work, & Mobile > > > > (317) 663-0513 Home (no voice mail please) > > 1-(317)-536-2730 FAX > > -----Original Message----- > From: Steffen Durinck [mailto:durincks at mail.nih.gov] > Sent: Friday, January 05, 2007 8:57 AM > To: Kimpel, Mark William > Cc: bioconductor at stat.math.ethz.ch > Subject: Re: [BioC] help with protein IPI annotation mappings > > Hi Mark, > > I don't have an IPI id to test this but biomaRt should be able to do > what you want. > Here's how: > > library(biomaRt) > ipiID = c("your","list","of","IPI","IDs") > ensembl=useMart("ensembl", dataset="rnorvegicus_gene_ensembl") > getBM(attributes=c("ipi","entezgene"), filters="ipi", values=ipiID, > mart=ensembl) > > you could probably make the result of the query a little cleaner by: > > getBM(attributes=c("ipi","entezgene"), filters=c("ipi","with_ipi"), > values=list(ids=ipiID,""), mart=ensembl) > > best, > Steffen > > > > Kimpel, Mark William wrote: > >> I need to map a list of rat International Protein Index accession ids >> > to > >> EntrezGene. The proteins have been identified using mass spectroscopy >> and thus do not necessarily correspond to any particular affy chipset. >> How would I do this in BioC? Can biomaRt handle this? >> >> >> >> Thanks, >> >> Mark >> >> >> >> Mark W. Kimpel MD >> >> >> >> >> >> Official Business Address: >> >> >> >> Department of Psychiatry >> >> Indiana University School of Medicine >> >> PR M116 >> >> Institute of Psychiatric Research >> >> 791 Union Drive >> >> Indianapolis, IN 46202 >> >> >> >> Preferred Mailing Address: >> >> >> >> 15032 Hunter Court >> >> Westfield, IN 46074 >> >> >> >> (317) 490-5129 Work, & Mobile >> >> >> >> (317) 663-0513 Home (no voice mail please) >> >> 1-(317)-536-2730 FAX >> >> >> [[alternative HTML version deleted]] >> >> _______________________________________________ >> Bioconductor mailing list >> Bioconductor at stat.math.ethz.ch >> https://stat.ethz.ch/mailman/listinfo/bioconductor >> Search the archives: >> > http://news.gmane.org/gmane.science.biology.informatics.conductor > >> >> > > > -- Steffen Durinck, Ph.D. Oncogenomics Section Pediatric Oncology Branch National Cancer Institute, National Institutes of Health URL: http://home.ccr.cancer.gov/oncology/oncogenomics/ Phone: 301-402-8103 Address: Advanced Technology Center, 8717 Grovemont Circle Gaithersburg, MD 20877

ADD REPLY • link 19.1 years ago Steffen Durinck ▴ 580

0

Entering edit mode

On Monday 08 January 2007 10:22, Steffen Durinck wrote: > Hi Mark, > > I quickly scanned the attributes and filters and it looks like you > currently can not use genbank accession numbers with Ensembl. > To be sure you could ask the Ensembl helpdesk: helpdesk at ensembl.org if > genbank accession numbers are in their database and what the name of the > corresponding filter is. If they don't have genbank ids you could ask > them if there is a possibility to include genbank ids in future releases. > Whatever information Ensembl makes available is retrievable through the > biomaRt package and questions or suggestions related to the data > present in Ensembl can be best addressed to their helpdesk. Make sure > you let them know you are using the BioMart version of Ensembl. > > Cheers, > Steffen > > Kimpel, Mark William wrote: > > Steffen, > > > > Your code to convert IPI to entrezgene ID's worked like charm. Now I > > have run into another problem. I have discovered that some of the ID's I > > need to map are GenBank ID's of the form (GI:XXXX). I have used > > listAttributes(ensembl) and cannot figure out which, if any correspond > > to the NCBI GI. A previous post in this list indicated that this should > > be possible, but I must be missing something. This can be accomplished with eutils from NCBI pretty easily. If you have a GI number (without the 'GI:') like: 47078294 (which corresponds to refseq NM_000022, just for example) You can use eLink to get the reference to the Entrez Gene database, if you like, by doing: readLines(url('http://www.ncbi.nlm.nih.gov/entrez/eutils/elink.fcgi?db from=nucleotide&db=gene&id=47078294')) This will return XML and the <id>100</id> tag is the Gene ID of that GI number. I show here just the readLines output, but you could also use the XML package to do the parsing of the output if you liked. If you loop over your GI numbers, you can retrieve them all. Be sure to leave a little time between queries so that you don't set off any alarms at NCBI about too many queries in too little time. Hope that helps. Sean

ADD REPLY • link 19.1 years ago Sean Davis 21k

0

Entering edit mode

Sorry I've come in a bit late on this topic .. Elink is a nice choice, you can also get the tab delimited flat file of the IPI cross-reference database at: ftp://ftp.ebi.ac.uk/pub/databases/IPI/current/ 1# Database from which master entry of this IPI entry has been taken. One of either SP (UniProtKB/Swiss-Prot), TR (UniProtKB/TrEMBL), ENSEMBL (Ensembl), ENSEMBL_HAVANA (Ensembl Havana subset), REFSEQ_STATUS (where STATUS corresponds to the RefSeq entry revision status), VEGA (Vega), TAIR (TAIR Protein data set) or HINV (H-Invitational Database). 2# UniProtKB accession number or Vega ID or Ensembl ID or RefSeq ID or TAIR Protein ID or H-InvDB ID. 3# International Protein Index identifier. 4# Supplementary UniProtKB/Swiss-Prot entries associated with this IPI entry. 5# Supplementary UniProtKB/TrEMBL entries associated with this IPI entry. 6# Supplementary Ensembl entries associated with this IPI entry. Havana curated transcripts preceeded by the key HAVANA: (e.g. HAVANA:ENSP00000237305;ENSP00000356824;). 7# Supplementary list of RefSeq STATUS:ID couples (separated by a semi-colon ';') associated with this IPI entry (RefSeq entry revision status details). 8# Supplementary TAIR Protein entries associated with this IPI entry. 9# Supplementary H-Inv Protein entries associated with this IPI entry. 10# Protein identifiers (cross reference to EMBL/Genbank/DDBJ nucleotide databases). 11# List of HGNC number, HGNC official gene symbol couples (separated by a semi-colon ';') associated with this IPI entry. 12# List of NCBI Entrez Gene gene number, Entrez Gene Default Gene Symbol couples (separated by a semi-colon ';') associated with this IPI entry. 13# UNIPARC identifier associated with the sequence of this IPI entry. 14# UniGene identifiers associated with this IPI entry. 15# CCDS identifiers associated with this IPI entry. 16# RefSeq GI protein identifiers associated with this IPI entry. 17# Supplementary Vega entries associated with this IPI entry. ... see http://www.ebi.ac.uk/IPI/xrefs.html Columns 3 an 7 would probably suite you and would be easy to read into R. Actually you should probably choose columns 3 and 7 when column 1 is REFSEQ_*. (note you can also get the mysql dump of this database which is even better if you know some SQL). There might be only a few missing (no REFSEQ) that you can get with elink as Sean suggests. Cheers Paul Leo -----Original Message----- From: bioconductor-bounces@stat.math.ethz.ch [mailto:bioconductor-bounces at stat.math.ethz.ch] On Behalf Of Sean Davis Sent: Tuesday, 9 January 2007 1:48 AM To: bioconductor at stat.math.ethz.ch Subject: Re: [BioC] help with protein IPI annotation mappings On Monday 08 January 2007 10:22, Steffen Durinck wrote: > Hi Mark, > > I quickly scanned the attributes and filters and it looks like you > currently can not use genbank accession numbers with Ensembl. > To be sure you could ask the Ensembl helpdesk: helpdesk at ensembl.org if > genbank accession numbers are in their database and what the name of the > corresponding filter is. If they don't have genbank ids you could ask > them if there is a possibility to include genbank ids in future releases. > Whatever information Ensembl makes available is retrievable through the > biomaRt package and questions or suggestions related to the data > present in Ensembl can be best addressed to their helpdesk. Make sure > you let them know you are using the BioMart version of Ensembl. > > Cheers, > Steffen > > Kimpel, Mark William wrote: > > Steffen, > > > > Your code to convert IPI to entrezgene ID's worked like charm. Now I > > have run into another problem. I have discovered that some of the ID's I > > need to map are GenBank ID's of the form (GI:XXXX). I have used > > listAttributes(ensembl) and cannot figure out which, if any correspond > > to the NCBI GI. A previous post in this list indicated that this should > > be possible, but I must be missing something. This can be accomplished with eutils from NCBI pretty easily. If you have a GI number (without the 'GI:') like: 47078294 (which corresponds to refseq NM_000022, just for example) You can use eLink to get the reference to the Entrez Gene database, if you like, by doing: readLines(url('http://www.ncbi.nlm.nih.gov/entrez/eutils/elink.fcgi?db fr om=nucleotide&db=gene&id=47078294')) This will return XML and the <id>100</id> tag is the Gene ID of that GI number. I show here just the readLines output, but you could also use the XML package to do the parsing of the output if you liked. If you loop over your GI numbers, you can retrieve them all. Be sure to leave a little time between queries so that you don't set off any alarms at NCBI about too many queries in too little time. Hope that helps. Sean _______________________________________________ Bioconductor mailing list Bioconductor at stat.math.ethz.ch https://stat.ethz.ch/mailman/listinfo/bioconductor Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor

ADD REPLY • link 19.1 years ago Paul Leo ▴ 10

0

Entering edit mode

Sean and Paul, Thanks for your help, it will work. Mark Mark W. Kimpel MD (317) 490-5129 Work, & Mobile (317) 663-0513 Home (no voice mail please) 1-(317)-536-2730 FAX -----Original Message----- From: bioconductor-bounces@stat.math.ethz.ch [mailto:bioconductor-bounces at stat.math.ethz.ch] On Behalf Of Paul Leo Sent: Monday, January 08, 2007 6:54 PM To: bioconductor at stat.math.ethz.ch Subject: Re: [BioC] help with protein IPI annotation mappings Sorry I've come in a bit late on this topic .. Elink is a nice choice, you can also get the tab delimited flat file of the IPI cross-reference database at: ftp://ftp.ebi.ac.uk/pub/databases/IPI/current/ 1# Database from which master entry of this IPI entry has been taken. One of either SP (UniProtKB/Swiss-Prot), TR (UniProtKB/TrEMBL), ENSEMBL (Ensembl), ENSEMBL_HAVANA (Ensembl Havana subset), REFSEQ_STATUS (where STATUS corresponds to the RefSeq entry revision status), VEGA (Vega), TAIR (TAIR Protein data set) or HINV (H-Invitational Database). 2# UniProtKB accession number or Vega ID or Ensembl ID or RefSeq ID or TAIR Protein ID or H-InvDB ID. 3# International Protein Index identifier. 4# Supplementary UniProtKB/Swiss-Prot entries associated with this IPI entry. 5# Supplementary UniProtKB/TrEMBL entries associated with this IPI entry. 6# Supplementary Ensembl entries associated with this IPI entry. Havana curated transcripts preceeded by the key HAVANA: (e.g. HAVANA:ENSP00000237305;ENSP00000356824;). 7# Supplementary list of RefSeq STATUS:ID couples (separated by a semi-colon ';') associated with this IPI entry (RefSeq entry revision status details). 8# Supplementary TAIR Protein entries associated with this IPI entry. 9# Supplementary H-Inv Protein entries associated with this IPI entry. 10# Protein identifiers (cross reference to EMBL/Genbank/DDBJ nucleotide databases). 11# List of HGNC number, HGNC official gene symbol couples (separated by a semi-colon ';') associated with this IPI entry. 12# List of NCBI Entrez Gene gene number, Entrez Gene Default Gene Symbol couples (separated by a semi-colon ';') associated with this IPI entry. 13# UNIPARC identifier associated with the sequence of this IPI entry. 14# UniGene identifiers associated with this IPI entry. 15# CCDS identifiers associated with this IPI entry. 16# RefSeq GI protein identifiers associated with this IPI entry. 17# Supplementary Vega entries associated with this IPI entry. ... see http://www.ebi.ac.uk/IPI/xrefs.html Columns 3 an 7 would probably suite you and would be easy to read into R. Actually you should probably choose columns 3 and 7 when column 1 is REFSEQ_*. (note you can also get the mysql dump of this database which is even better if you know some SQL). There might be only a few missing (no REFSEQ) that you can get with elink as Sean suggests. Cheers Paul Leo -----Original Message----- From: bioconductor-bounces@stat.math.ethz.ch [mailto:bioconductor-bounces at stat.math.ethz.ch] On Behalf Of Sean Davis Sent: Tuesday, 9 January 2007 1:48 AM To: bioconductor at stat.math.ethz.ch Subject: Re: [BioC] help with protein IPI annotation mappings On Monday 08 January 2007 10:22, Steffen Durinck wrote: > Hi Mark, > > I quickly scanned the attributes and filters and it looks like you > currently can not use genbank accession numbers with Ensembl. > To be sure you could ask the Ensembl helpdesk: helpdesk at ensembl.org if > genbank accession numbers are in their database and what the name of the > corresponding filter is. If they don't have genbank ids you could ask > them if there is a possibility to include genbank ids in future releases. > Whatever information Ensembl makes available is retrievable through the > biomaRt package and questions or suggestions related to the data > present in Ensembl can be best addressed to their helpdesk. Make sure > you let them know you are using the BioMart version of Ensembl. > > Cheers, > Steffen > > Kimpel, Mark William wrote: > > Steffen, > > > > Your code to convert IPI to entrezgene ID's worked like charm. Now I > > have run into another problem. I have discovered that some of the ID's I > > need to map are GenBank ID's of the form (GI:XXXX). I have used > > listAttributes(ensembl) and cannot figure out which, if any correspond > > to the NCBI GI. A previous post in this list indicated that this should > > be possible, but I must be missing something. This can be accomplished with eutils from NCBI pretty easily. If you have a GI number (without the 'GI:') like: 47078294 (which corresponds to refseq NM_000022, just for example) You can use eLink to get the reference to the Entrez Gene database, if you like, by doing: readLines(url('http://www.ncbi.nlm.nih.gov/entrez/eutils/elink.fcgi?db fr om=nucleotide&db=gene&id=47078294')) This will return XML and the <id>100</id> tag is the Gene ID of that GI number. I show here just the readLines output, but you could also use the XML package to do the parsing of the output if you liked. If you loop over your GI numbers, you can retrieve them all. Be sure to leave a little time between queries so that you don't set off any alarms at NCBI about too many queries in too little time. Hope that helps. Sean _______________________________________________ Bioconductor mailing list Bioconductor at stat.math.ethz.ch https://stat.ethz.ch/mailman/listinfo/bioconductor Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor _______________________________________________ Bioconductor mailing list Bioconductor at stat.math.ethz.ch https://stat.ethz.ch/mailman/listinfo/bioconductor Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor

ADD REPLY • link 19.1 years ago Kimpel, Mark W ▴ 890

Login before adding your answer.