Entering edit mode
Simon Lin
Last seen 10.5 years ago
In the following two unrelated messages, both Sean and Nianhua
to download and parse some data tables from the NCBI. The gene_info
several other tables seems very useful. If that is the case, why not
have it pre-loaded into a SQlite and distribute it as part of the
annotation package for human? Simon ================= Date: Tue, 12
2007 05:59:55 -0400 From: Sean Davis <sdavis2 at="" mail.nih.gov="">
Subject: Re:
[BioC] from RefSeq GI protein identifiers to GO terms To: Lina
Hultin-Rosenberg <lina.hultin-rosenberg at="" ki.se=""> Cc:
bioconductor at stat.math.ethz.ch Message-ID:
<466E6E9B.3020609 at mail.nih.gov> Content-Type: text/plain;
charset=ISO-8859-1 Lina Hultin-Rosenberg wrote:
>> Dear list,
>> This might be a question that has been discussed previously but I
could not
>> find any good solution for it. I have lists of human proteins from
>> proteomics studies that I want to compare with regards to the GO
>> associated to them. I have the RefSeq GI protein id for the
proteins and my
>> questions is how I best map those to other identifiers that I can
use in
>> subsequent GO analysis?
>> It might be that this problem is solved best outside R but maybe
>> still can give me a hint to the best solution. For me this is a
problem that
>> comes up quite often - the need to map between different
identifiers - and I
>> have not yet find any really good solution to it. If I for example
use IPI I
>> always loose some proteins/genes since the coverage is rather bad,
but maybe
>> there is no solution that will give perfect mapping?!
The file located here:
and described in detail here:
maps refseq to Entrez Gene ID. Once you have the Entrez Gene ID, you
can use the bioconductor annotation packages to get GO mappings. The
file above is a tab-delimited text file, so you should be able to read
it into R and do the matching by GI number rather easily.
Hope that helps.
Message: 4
Date: Mon, 11 Jun 2007 12:36:31 +0000 (UTC)
From: Nianhua Li <nialicn@yahoo.com>
Subject: Re: [BioC] getting Locus Link ids from gene symbol
To: bioconductor at stat.math.ethz.ch
Message-ID: <loom.20070611t142932-100 at="" post.gmane.org="">
Content-Type: text/plain; charset=us-ascii
Hi, Alex,
You can parse ftp://ftp.ncbi.nlm.nih.gov/gene/DATA/gene_info.gz
There are 4 useful columns: tax_id (column 1), GeneID (column 2),
(column 3), and Synonyms (column 5). You can:
1 Read in the file
2 filter it based on tax_id
3 match your gene symboles to the "Symbol" column and find their Gene
4 removed the matched gene symboles from your list
5 match the rest of gene symboles to the "Synonyms" column and find
their Gene
hope this helps
Nianhua Li
Software Developer