from RefSeq to GO terms / gene symbol to geneID

0

Entering edit mode

Simon Lin ▴ 270

@simon-lin-1272

Last seen 9.6 years ago

In the following two unrelated messages, both Sean and Nianhua suggested to download and parse some data tables from the NCBI. The gene_info and several other tables seems very useful. If that is the case, why not have it pre-loaded into a SQlite and distribute it as part of the annotation package for human? Simon ================= Date: Tue, 12 Jun 2007 05:59:55 -0400 From: Sean Davis <sdavis2 at="" mail.nih.gov=""> Subject: Re: [BioC] from RefSeq GI protein identifiers to GO terms To: Lina Hultin-Rosenberg <lina.hultin-rosenberg at="" ki.se=""> Cc: bioconductor at stat.math.ethz.ch Message-ID: <466E6E9B.3020609 at mail.nih.gov> Content-Type: text/plain; charset=ISO-8859-1 Lina Hultin-Rosenberg wrote: >> Dear list, >> >> This might be a question that has been discussed previously but I could not >> find any good solution for it. I have lists of human proteins from various >> proteomics studies that I want to compare with regards to the GO terms >> associated to them. I have the RefSeq GI protein id for the proteins and my >> questions is how I best map those to other identifiers that I can use in >> subsequent GO analysis? >> >> It might be that this problem is solved best outside R but maybe someone >> still can give me a hint to the best solution. For me this is a problem that >> comes up quite often - the need to map between different identifiers - and I >> have not yet find any really good solution to it. If I for example use IPI I >> always loose some proteins/genes since the coverage is rather bad, but maybe >> there is no solution that will give perfect mapping?! > > The file located here: ftp://ftp.ncbi.nih.gov/gene/DATA/gene2refseq.gz and described in detail here: ftp://ftp.ncbi.nih.gov/gene/DATA/README maps refseq to Entrez Gene ID. Once you have the Entrez Gene ID, you can use the bioconductor annotation packages to get GO mappings. The file above is a tab-delimited text file, so you should be able to read it into R and do the matching by GI number rather easily. Hope that helps. Sean ======================== Message: 4 Date: Mon, 11 Jun 2007 12:36:31 +0000 (UTC) From: Nianhua Li <nialicn@yahoo.com> Subject: Re: [BioC] getting Locus Link ids from gene symbol To: bioconductor at stat.math.ethz.ch Message-ID: <loom.20070611t142932-100 at="" post.gmane.org=""> Content-Type: text/plain; charset=us-ascii Hi, Alex, You can parse ftp://ftp.ncbi.nlm.nih.gov/gene/DATA/gene_info.gz There are 4 useful columns: tax_id (column 1), GeneID (column 2), Symbol (column 3), and Synonyms (column 5). You can: 1 Read in the file 2 filter it based on tax_id 3 match your gene symboles to the "Symbol" column and find their Gene ID 4 removed the matched gene symboles from your list 5 match the rest of gene symboles to the "Synonyms" column and find their Gene ID hope this helps nianhua Nianhua Li Software Developer

Proteomics Coverage Annotation GO Proteomics Coverage Annotation GO • 3.3k views

ADD COMMENT • link 16.9 years ago • updated 16.8 years ago Simon Lin ▴ 270

0

Entering edit mode

Lina Hultin-Rosenberg ▴ 80

@lina-hultin-rosenberg-2207

Last seen 9.6 years ago

Thank you so much for your help, I will try those alternatives out. Best, Lina Simon Lin skrev: > In the following two unrelated messages, both Sean and Nianhua suggested > to download and parse some data tables from the NCBI. The gene_info and > several other tables seems very useful. If that is the case, why not > have it pre-loaded into a SQlite and distribute it as part of the > annotation package for human? Simon ================= Date: Tue, 12 Jun > 2007 05:59:55 -0400 From: Sean Davis <sdavis2 at="" mail.nih.gov=""> Subject: Re: > [BioC] from RefSeq GI protein identifiers to GO terms To: Lina > Hultin-Rosenberg <lina.hultin-rosenberg at="" ki.se=""> Cc: > bioconductor at stat.math.ethz.ch Message-ID: > <466E6E9B.3020609 at mail.nih.gov> Content-Type: text/plain; > charset=ISO-8859-1 Lina Hultin-Rosenberg wrote: > >>> Dear list, >>> >>> This might be a question that has been discussed previously but I could not >>> find any good solution for it. I have lists of human proteins from various >>> proteomics studies that I want to compare with regards to the GO terms >>> associated to them. I have the RefSeq GI protein id for the proteins and my >>> questions is how I best map those to other identifiers that I can use in >>> subsequent GO analysis? >>> >>> It might be that this problem is solved best outside R but maybe someone >>> still can give me a hint to the best solution. For me this is a problem that >>> comes up quite often - the need to map between different identifiers - and I >>> have not yet find any really good solution to it. If I for example use IPI I >>> always loose some proteins/genes since the coverage is rather bad, but maybe >>> there is no solution that will give perfect mapping?! >> >> > > The file located here: > > ftp://ftp.ncbi.nih.gov/gene/DATA/gene2refseq.gz > > and described in detail here: > > ftp://ftp.ncbi.nih.gov/gene/DATA/README > > maps refseq to Entrez Gene ID. Once you have the Entrez Gene ID, you > can use the bioconductor annotation packages to get GO mappings. The > file above is a tab-delimited text file, so you should be able to read > it into R and do the matching by GI number rather easily. > > Hope that helps. > > Sean > > ======================== > Message: 4 > Date: Mon, 11 Jun 2007 12:36:31 +0000 (UTC) > From: Nianhua Li <nialicn at="" yahoo.com=""> > Subject: Re: [BioC] getting Locus Link ids from gene symbol > To: bioconductor at stat.math.ethz.ch > Message-ID: <loom.20070611t142932-100 at="" post.gmane.org=""> > Content-Type: text/plain; charset=us-ascii > > Hi, Alex, > > You can parse ftp://ftp.ncbi.nlm.nih.gov/gene/DATA/gene_info.gz > There are 4 useful columns: tax_id (column 1), GeneID (column 2), Symbol > (column 3), and Synonyms (column 5). You can: > > 1 Read in the file > 2 filter it based on tax_id > 3 match your gene symboles to the "Symbol" column and find their Gene ID > 4 removed the matched gene symboles from your list > 5 match the rest of gene symboles to the "Synonyms" column and find their Gene > ID > > hope this helps > > nianhua > > Nianhua Li > Software Developer > > _______________________________________________ > Bioconductor mailing list > Bioconductor at stat.math.ethz.ch > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor > >

ADD COMMENT • link 16.9 years ago Lina Hultin-Rosenberg ▴ 80

0

Entering edit mode

Lina Hultin-Rosenberg ▴ 80

@lina-hultin-rosenberg-2207

Last seen 9.6 years ago

Dear Simon and Sean, sorry to get back to this issue so late but I have tried out various options to try to solve it. I parsed the files you mentioned but did not get many hits since many of my proteins does not have a Entrez gene id for some reason. In my search I also tried some of the Entrez e-utils (http://eutils.ncbi.nlm.nih.gov/entrez/query/static/eutils_help.html) and could get the accession numbers for my proteins. Can I go from accession number to GO term using biomaRt for example? Thanks again! Best, Lina Rosenberg Simon Lin skrev: > In the following two unrelated messages, both Sean and Nianhua suggested > to download and parse some data tables from the NCBI. The gene_info and > several other tables seems very useful. If that is the case, why not > have it pre-loaded into a SQlite and distribute it as part of the > annotation package for human? Simon ================= Date: Tue, 12 Jun > 2007 05:59:55 -0400 From: Sean Davis <sdavis2 at="" mail.nih.gov=""> Subject: Re: > [BioC] from RefSeq GI protein identifiers to GO terms To: Lina > Hultin-Rosenberg <lina.hultin-rosenberg at="" ki.se=""> Cc: > bioconductor at stat.math.ethz.ch Message-ID: > <466E6E9B.3020609 at mail.nih.gov> Content-Type: text/plain; > charset=ISO-8859-1 Lina Hultin-Rosenberg wrote: > >>> Dear list, >>> >>> This might be a question that has been discussed previously but I could not >>> find any good solution for it. I have lists of human proteins from various >>> proteomics studies that I want to compare with regards to the GO terms >>> associated to them. I have the RefSeq GI protein id for the proteins and my >>> questions is how I best map those to other identifiers that I can use in >>> subsequent GO analysis? >>> >>> It might be that this problem is solved best outside R but maybe someone >>> still can give me a hint to the best solution. For me this is a problem that >>> comes up quite often - the need to map between different identifiers - and I >>> have not yet find any really good solution to it. If I for example use IPI I >>> always loose some proteins/genes since the coverage is rather bad, but maybe >>> there is no solution that will give perfect mapping?! >> >> > > The file located here: > > ftp://ftp.ncbi.nih.gov/gene/DATA/gene2refseq.gz > > and described in detail here: > > ftp://ftp.ncbi.nih.gov/gene/DATA/README > > maps refseq to Entrez Gene ID. Once you have the Entrez Gene ID, you > can use the bioconductor annotation packages to get GO mappings. The > file above is a tab-delimited text file, so you should be able to read > it into R and do the matching by GI number rather easily. > > Hope that helps. > > Sean > > ======================== > Message: 4 > Date: Mon, 11 Jun 2007 12:36:31 +0000 (UTC) > From: Nianhua Li <nialicn at="" yahoo.com=""> > Subject: Re: [BioC] getting Locus Link ids from gene symbol > To: bioconductor at stat.math.ethz.ch > Message-ID: <loom.20070611t142932-100 at="" post.gmane.org=""> > Content-Type: text/plain; charset=us-ascii > > Hi, Alex, > > You can parse ftp://ftp.ncbi.nlm.nih.gov/gene/DATA/gene_info.gz > There are 4 useful columns: tax_id (column 1), GeneID (column 2), Symbol > (column 3), and Synonyms (column 5). You can: > > 1 Read in the file > 2 filter it based on tax_id > 3 match your gene symboles to the "Symbol" column and find their Gene ID > 4 removed the matched gene symboles from your list > 5 match the rest of gene symboles to the "Synonyms" column and find their Gene > ID > > hope this helps > > nianhua > > Nianhua Li > Software Developer > > _______________________________________________ > Bioconductor mailing list > Bioconductor at stat.math.ethz.ch > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor > >

ADD COMMENT • link 16.8 years ago Lina Hultin-Rosenberg ▴ 80

0

Entering edit mode

Simon Lin ▴ 270

@simon-lin-1272

Last seen 9.6 years ago

If you do not have a large number of sequences, BioMart is a good choice. -Simon ----- Original Message ----- From: "Lina Hultin-Rosenberg" <lina.hultin-rosenberg@ki.se> To: "Simon Lin" <simonlin at="" duke.edu=""> Cc: <sdavis2 at="" mail.nih.gov="">; <bioconductor at="" stat.math.ethz.ch=""> Sent: Friday, June 29, 2007 12:54 AM Subject: Re: [BioC] from RefSeq to GO terms / gene symbol to geneID > Dear Simon and Sean, > > sorry to get back to this issue so late but I have tried out various > options to try to solve it. I parsed the files you mentioned but did not > get many hits since many of my proteins does not have a Entrez gene id for > some reason. In my search I also tried some of the Entrez e-utils > (http://eutils.ncbi.nlm.nih.gov/entrez/query/static/eutils_help.html) and > could get the accession numbers for my proteins. Can I go from accession > number to GO term using biomaRt for example? > > Thanks again! > > Best, > Lina Rosenberg > > Simon Lin skrev: >> In the following two unrelated messages, both Sean and Nianhua suggested >> to download and parse some data tables from the NCBI. The gene_info and >> several other tables seems very useful. If that is the case, why not have >> it pre-loaded into a SQlite and distribute it as part of the annotation >> package for human? Simon ================= Date: Tue, 12 Jun 2007 >> 05:59:55 -0400 From: Sean Davis <sdavis2 at="" mail.nih.gov=""> Subject: Re: >> [BioC] from RefSeq GI protein identifiers to GO terms To: Lina >> Hultin-Rosenberg <lina.hultin-rosenberg at="" ki.se=""> Cc: >> bioconductor at stat.math.ethz.ch Message-ID: >> <466E6E9B.3020609 at mail.nih.gov> Content-Type: text/plain; >> charset=ISO-8859-1 Lina Hultin-Rosenberg wrote: >> >>>> Dear list, >>>> >>>> This might be a question that has been discussed previously but I could >>>> not >>>> find any good solution for it. I have lists of human proteins from >>>> various >>>> proteomics studies that I want to compare with regards to the GO terms >>>> associated to them. I have the RefSeq GI protein id for the proteins >>>> and my >>>> questions is how I best map those to other identifiers that I can use >>>> in >>>> subsequent GO analysis? >>>> It might be that this problem is solved best outside R but maybe >>>> someone >>>> still can give me a hint to the best solution. For me this is a problem >>>> that >>>> comes up quite often - the need to map between different identifiers - >>>> and I >>>> have not yet find any really good solution to it. If I for example use >>>> IPI I >>>> always loose some proteins/genes since the coverage is rather bad, but >>>> maybe >>>> there is no solution that will give perfect mapping?! >>> >> >> The file located here: >> >> ftp://ftp.ncbi.nih.gov/gene/DATA/gene2refseq.gz >> >> and described in detail here: >> >> ftp://ftp.ncbi.nih.gov/gene/DATA/README >> >> maps refseq to Entrez Gene ID. Once you have the Entrez Gene ID, you >> can use the bioconductor annotation packages to get GO mappings. The >> file above is a tab-delimited text file, so you should be able to read >> it into R and do the matching by GI number rather easily. >> >> Hope that helps. >> >> Sean >> >> ======================== >> Message: 4 >> Date: Mon, 11 Jun 2007 12:36:31 +0000 (UTC) >> From: Nianhua Li <nialicn at="" yahoo.com=""> >> Subject: Re: [BioC] getting Locus Link ids from gene symbol >> To: bioconductor at stat.math.ethz.ch >> Message-ID: <loom.20070611t142932-100 at="" post.gmane.org=""> >> Content-Type: text/plain; charset=us-ascii >> >> Hi, Alex, >> >> You can parse ftp://ftp.ncbi.nlm.nih.gov/gene/DATA/gene_info.gz >> There are 4 useful columns: tax_id (column 1), GeneID (column 2), Symbol >> (column 3), and Synonyms (column 5). You can: >> >> 1 Read in the file >> 2 filter it based on tax_id >> 3 match your gene symboles to the "Symbol" column and find their Gene ID >> 4 removed the matched gene symboles from your list >> 5 match the rest of gene symboles to the "Synonyms" column and find their >> Gene ID >> >> hope this helps >> >> nianhua >> >> Nianhua Li >> Software Developer >> >> _______________________________________________ >> Bioconductor mailing list >> Bioconductor at stat.math.ethz.ch >> https://stat.ethz.ch/mailman/listinfo/bioconductor >> Search the archives: >> http://news.gmane.org/gmane.science.biology.informatics.conductor >> >> > > >

ADD COMMENT • link 16.8 years ago Simon Lin ▴ 270

Login before adding your answer.