Entering edit mode
Dick Beyer
★
1.4k
@dick-beyer-26
Last seen 10.3 years ago
Hi to all,
For several years now, I have been doing GO analysis on lists of
proteins derived from MS. I am given IPIs by the proteomics folks and
need the corresponding Entrez Gene IDs. Putting aside the issues of
non-unique mapping from IPI to EG, isoforms, etc., I was wondering if
anyone would comment on my method of getting the Entrez Gene IDs. I'd
really like to use Marc Carlson's merge method (shown below), but that
approach seems to miss several thousand IPI/EG matches that my method
finds.
I start with
ftp://ftp.ebi.ac.uk/pub/databases/IPI/current/ipi.HUMAN.dat.gz, and
extract a subset of the rows:
ipiHUMAN <- readLines(con = "ipi.HUMAN.3.80.dat",n=-1) # build
11feb2011
dbfetch.all <- ipiHUMAN
rm(ipiHUMAN)
# Explanation of the data format is found here
# http://www.ebi.ac.uk/2can/tutorials/formats.html#swiss
length(dbfetch.all) # 3180244
length(eg <- grep("^DR Entrez Gene", dbfetch.all)) # 80296
length(ids <- grep("^ID", dbfetch.all)) # 86719
length(de <- grep("^DE", dbfetch.all)) # 92454
length(ac <- grep("^AC", dbfetch.all)) # 93720
length(ug <- grep("^DR UniGene", dbfetch.all)) # 88314
length(up <- grep("^DR UniProtKB", dbfetch.all)) # 110593
length(en <- grep("^DR ENSEMBL", dbfetch.all)) # 77340
length(rs <- grep("^DR REFSEQ_REVIEWED",dbfetch.all)) # 14559
and eventually turn this into a data.frame with the columns:
"IPI","EG","GeneSymbol","UniGene","UniProtKB","ENSEMBL","REFSEQ_REVIEW
ED"
(Note: Not every IPI entry has every field)
For this build of the IPI file, my data.frame ends up as
dim(dat.all)
[1] 183153 7
Of these 183153 IPIs, there are 171909 unique IPIs, and 22342 unique
Entrez Gene IDs.
The merge method shown below from Marc Carlson gives 69315 unique IPIs
and 17783 unique Entrez Gene IDs (you get the same numbers whether you
use org.Hs.egGO2ALLEGS or org.Hs.egGO).
When I build my 7 column data.frame, I initially get 22305 unique
Entrez Gene IDs, and I then go through some additional steps of trying
to fill in the missing EGs. I do this by taking the IPIs with no EGs,
and using biomaRt with UniGene, UniProtKB etc as inputs to getBM(),
and hope I get a few more EGs.
For example:
library(biomaRt)
mart <- useMart( "ensembl", dataset="hsapiens_gene_ensembl")
length(whichis.na(dat.all[,4])))
sum(z <- !is.na(dat.all[,4]))
w <- getBM(attributes=c("entrezgene","unigene","hgnc_symbol","descript
ion"),filters="unigene",values=dat.all[z,4], mart=mart)
By doing several of these getBM() steps, I add 37 more EGs!
My method is long and painful. That merge approach is clean and
beautiful.
Is there a way to add to the merge argument or something that would
give me the additional 100K+ IPIs and 4500+ EGs?
------------------------------
Message: 20
Date: Fri, 18 Feb 2011 13:17:18 -0800
From: "Carlson, Marc R" <mcarlson@fhcrc.org>
To: <bioconductor at="" stat.math.ethz.ch="">
Subject: Re: [BioC] IPI to entrez id
Message-ID:
<1688456294.5987.1298063838120.JavaMail.root at
zimbra4.fhcrc.org>
Content-Type: text/plain; charset="utf-8"
Hi Viritha,
These things can never be 1:1, but you can pretty easily just cram
them all into a huge data.frame by doing this:
library(org.Hs.eg.db)
allAnnots <- merge(toTable(org.Hs.egPROSITE), toTable(org.Hs.egGO),
by.x="gene_id", by.y="gene_id")
head(allAnnots)
Once you have done this, you may notice that they are not only are
these things almost never (if ever) 1:1, but that this could have been
even worse if I had used the GO2ALL mappings (and I probably should
have, but I can't really tell because I have almost no information
about what you want to do). Anyhow, I hope this helps you, but if you
have a more specific use for this information that you are willing to
talk about then we might be able to give you a better answer.
Marc
------------------------------
Thanks very much,
Dick
**********************************************************************
*********
Richard P. Beyer, Ph.D. University of Washington
Tel.:(206) 616 7378 Env. & Occ. Health Sci. , Box 354695
Fax: (206) 685 4696 4225 Roosevelt Way NE, # 100
Seattle, WA 98105-6099
http://depts.washington.edu/ceeh/members_fc_bioinfo.html
http://staff.washington.edu/~dbeyer