IPI to entrez id

0

Entering edit mode

Dick Beyer ★ 1.4k

@dick-beyer-26

Last seen 11.2 years ago

Hi to all, For several years now, I have been doing GO analysis on lists of proteins derived from MS. I am given IPIs by the proteomics folks and need the corresponding Entrez Gene IDs. Putting aside the issues of non-unique mapping from IPI to EG, isoforms, etc., I was wondering if anyone would comment on my method of getting the Entrez Gene IDs. I'd really like to use Marc Carlson's merge method (shown below), but that approach seems to miss several thousand IPI/EG matches that my method finds. I start with ftp://ftp.ebi.ac.uk/pub/databases/IPI/current/ipi.HUMAN.dat.gz, and extract a subset of the rows: ipiHUMAN <- readLines(con = "ipi.HUMAN.3.80.dat",n=-1) # build 11feb2011 dbfetch.all <- ipiHUMAN rm(ipiHUMAN) # Explanation of the data format is found here # http://www.ebi.ac.uk/2can/tutorials/formats.html#swiss length(dbfetch.all) # 3180244 length(eg <- grep("^DR Entrez Gene", dbfetch.all)) # 80296 length(ids <- grep("^ID", dbfetch.all)) # 86719 length(de <- grep("^DE", dbfetch.all)) # 92454 length(ac <- grep("^AC", dbfetch.all)) # 93720 length(ug <- grep("^DR UniGene", dbfetch.all)) # 88314 length(up <- grep("^DR UniProtKB", dbfetch.all)) # 110593 length(en <- grep("^DR ENSEMBL", dbfetch.all)) # 77340 length(rs <- grep("^DR REFSEQ_REVIEWED",dbfetch.all)) # 14559 and eventually turn this into a data.frame with the columns: "IPI","EG","GeneSymbol","UniGene","UniProtKB","ENSEMBL","REFSEQ_REVIEW ED" (Note: Not every IPI entry has every field) For this build of the IPI file, my data.frame ends up as dim(dat.all) [1] 183153 7 Of these 183153 IPIs, there are 171909 unique IPIs, and 22342 unique Entrez Gene IDs. The merge method shown below from Marc Carlson gives 69315 unique IPIs and 17783 unique Entrez Gene IDs (you get the same numbers whether you use org.Hs.egGO2ALLEGS or org.Hs.egGO). When I build my 7 column data.frame, I initially get 22305 unique Entrez Gene IDs, and I then go through some additional steps of trying to fill in the missing EGs. I do this by taking the IPIs with no EGs, and using biomaRt with UniGene, UniProtKB etc as inputs to getBM(), and hope I get a few more EGs. For example: library(biomaRt) mart <- useMart( "ensembl", dataset="hsapiens_gene_ensembl") length(whichis.na(dat.all[,4]))) sum(z <- !is.na(dat.all[,4])) w <- getBM(attributes=c("entrezgene","unigene","hgnc_symbol","descript ion"),filters="unigene",values=dat.all[z,4], mart=mart) By doing several of these getBM() steps, I add 37 more EGs! My method is long and painful. That merge approach is clean and beautiful. Is there a way to add to the merge argument or something that would give me the additional 100K+ IPIs and 4500+ EGs? ------------------------------ Message: 20 Date: Fri, 18 Feb 2011 13:17:18 -0800 From: "Carlson, Marc R" <mcarlson@fhcrc.org> To: <bioconductor at="" stat.math.ethz.ch=""> Subject: Re: [BioC] IPI to entrez id Message-ID: <1688456294.5987.1298063838120.JavaMail.root at zimbra4.fhcrc.org> Content-Type: text/plain; charset="utf-8" Hi Viritha, These things can never be 1:1, but you can pretty easily just cram them all into a huge data.frame by doing this: library(org.Hs.eg.db) allAnnots <- merge(toTable(org.Hs.egPROSITE), toTable(org.Hs.egGO), by.x="gene_id", by.y="gene_id") head(allAnnots) Once you have done this, you may notice that they are not only are these things almost never (if ever) 1:1, but that this could have been even worse if I had used the GO2ALL mappings (and I probably should have, but I can't really tell because I have almost no information about what you want to do). Anyhow, I hope this helps you, but if you have a more specific use for this information that you are willing to talk about then we might be able to give you a better answer. Marc ------------------------------ Thanks very much, Dick ********************************************************************** ********* Richard P. Beyer, Ph.D. University of Washington Tel.:(206) 616 7378 Env. & Occ. Health Sci. , Box 354695 Fax: (206) 685 4696 4225 Roosevelt Way NE, # 100 Seattle, WA 98105-6099 http://depts.washington.edu/ceeh/members_fc_bioinfo.html http://staff.washington.edu/~dbeyer

Proteomics GO biomaRt Proteomics GO biomaRt • 2.8k views

ADD COMMENT • link 14.7 years ago Dick Beyer ★ 1.4k

0

Entering edit mode

viritha kaza ▴ 580

@viritha-kaza-4318

Last seen 11.2 years ago

Hi Marc, Thanks for the explanation.I now realised that there can not be 1:1 relation between the IPI and the Entrez ids. Like how Dick was saying I used both the methods first by the site: ftp://ftp.ebi.ac.uk/pub/databases/IPI/current/ipi.HUMAN.xrefs.gz. and then the remaining with the code in biomart: as I had mentioned previously code: >source('http://bioconductor.org/biocLite.R') > biocLite("biomaRt") >library("biomaRt") >ensembl = useMart("ensembl", dataset = "hsapiens_gene_ensembl") >ipi=scan("ipi_test.txt",what =character(),sep='\n',quote="") # ipi_test file contained the IPI's that I wanted to convert >getBM(attributes = c("ipi","entrezgene","hgnc_symbol"),filters="ipi",values=ipi,mart = ensembl) >write.table(ipi_entrez,"ipi_entrez_test.txt",sep='\t') I could not find the merge method of yours which dick was mentioning. For now I am only interested in the entrez id and genesymbol only for the respective IPI in my list. Is there an easy method for this and if they could share it, then it would be excellent. Thanks, Viritha [[alternative HTML version deleted]]

ADD COMMENT • link 14.7 years ago viritha kaza ▴ 580

0

Entering edit mode

Hi Viritha, Marc's one liner seems very simple to do. What errors are you getting when you try this? You could try this source('http://bioconductor.org/biocLite.R') biocLite("org.Hs.eg.db") library(org.Hs.eg.db) allAnnots <- merge(toTable(org.Hs.egPROSITE), toTable(org.Hs.egGO), by.x="gene_id", by.y="gene_id") Works for me with R 2.12.2. I haven't used the ipi.HUMAN.xrefs, I just use this one: ftp://ftp.ebi.ac.uk/pub/databases/IPI/current/ipi.HUMAN.dat.gz Hope that helps, Dick ********************************************************************** ********* Richard P. Beyer, Ph.D. University of Washington Tel.:(206) 616 7378 Env. & Occ. Health Sci. , Box 354695 Fax: (206) 685 4696 4225 Roosevelt Way NE, # 100 Seattle, WA 98105-6099 http://depts.washington.edu/ceeh/members_fc_bioinfo.html http://staff.washington.edu/~dbeyer ********************************************************************** ********* On Wed, 9 Mar 2011, viritha kaza wrote: > Hi Marc, > Thanks for the explanation.I now realised that there can not be 1:1 relation between the IPI and the Entrez ids. Like how Dick was saying I used both the > methods first by the site: > ftp://ftp.ebi.ac.uk/pub/databases/IPI/current/ipi.HUMAN.xrefs.gz. > and?then?the remaining with the code in biomart: > as I had mentioned previously > ?code: > >source('http://bioconductor.org/biocLite.R') > > biocLite("biomaRt") > >library("biomaRt") > ?>ensembl = useMart("ensembl", dataset = "hsapiens_gene_ensembl") > >ipi=scan("ipi_test.txt",what =character(),sep='\n',quote="") > # ipi_test file contained the IPI's that I wanted to convert > >getBM(attributes = c("ipi","entrezgene","hgnc_symbol"),filters="ipi",values=ipi,mart = ensembl) > >write.table(ipi_entrez,"ipi_entrez_test.txt",sep='\t') > ? > I could not find the merge method of yours which dick was mentioning. > For now I am only interested in the entrez id and genesymbol only for the respective IPI in my list. > Is there an easy method for this and if they could share it, then it would be excellent. > Thanks, > Viritha > >

ADD REPLY • link 14.7 years ago Dick Beyer ★ 1.4k

0

Entering edit mode

Hi Viritha, The merge function is in base R. What is the output from your sessionInfo() Marc On 03/09/2011 01:19 PM, viritha kaza wrote: > Hi Marc, > Thanks for the explanation.I now realised that there can not be 1:1 > relation between the IPI and the Entrez ids. Like how Dick was saying > I used both the methods first by the site: > ftp://ftp.ebi.ac.uk/pub/databases/IPI/current/ipi.HUMAN.xrefs.gz. > and then the remaining with the code in biomart: > as I had mentioned previously > code: > >source('http://bioconductor.org/biocLite.R') > > biocLite("biomaRt") > >library("biomaRt") > >ensembl = useMart("ensembl", dataset = "hsapiens_gene_ensembl") > >ipi=scan("ipi_test.txt",what =character(),sep='\n',quote="") > # ipi_test file contained the IPI's that I wanted to convert > >getBM(attributes = > c("ipi","entrezgene","hgnc_symbol"),filters="ipi",values=ipi,mart = > ensembl) > >write.table(ipi_entrez,"ipi_entrez_test.txt",sep='\t') > > I could not find the merge method of yours which dick was mentioning. > For now I am only interested in the entrez id and genesymbol only for > the respective IPI in my list. > Is there an easy method for this and if they could share it, then it > would be excellent. > Thanks, > Viritha

ADD REPLY • link 14.7 years ago Marc Carlson ★ 7.2k

0

Entering edit mode

Hi Marc, The code is working. I will have to include in the code to extract the IPI's in my list. Thanks, Viritha On Wed, Mar 9, 2011 at 8:30 PM, Marc Carlson <mcarlson@fhcrc.org> wrote: > Hi Viritha, > > The merge function is in base R. > > What is the output from your sessionInfo() > > > Marc > > > On 03/09/2011 01:19 PM, viritha kaza wrote: > > Hi Marc, > > Thanks for the explanation.I now realised that there can not be 1:1 > > relation between the IPI and the Entrez ids. Like how Dick was saying > > I used both the methods first by the site: > > ftp://ftp.ebi.ac.uk/pub/databases/IPI/current/ipi.HUMAN.xrefs.gz. > > and then the remaining with the code in biomart: > > as I had mentioned previously > > code: > > >source('http://bioconductor.org/biocLite.R') > > > biocLite("biomaRt") > > >library("biomaRt") > > >ensembl = useMart("ensembl", dataset = "hsapiens_gene_ensembl") > > >ipi=scan("ipi_test.txt",what =character(),sep='\n',quote="") > > # ipi_test file contained the IPI's that I wanted to convert > > >getBM(attributes = > > c("ipi","entrezgene","hgnc_symbol"),filters="ipi",values=ipi,mart = > > ensembl) > > >write.table(ipi_entrez,"ipi_entrez_test.txt",sep='\t') > > > > I could not find the merge method of yours which dick was mentioning. > > For now I am only interested in the entrez id and genesymbol only for > > the respective IPI in my list. > > Is there an easy method for this and if they could share it, then it > > would be excellent. > > Thanks, > > Viritha > > [[alternative HTML version deleted]]

ADD REPLY • link 14.7 years ago viritha kaza ▴ 580

0

Entering edit mode

Marc Carlson ★ 7.2k

@marc-carlson-2264

Last seen 9.3 years ago

United States

Hi Dick, Is there any reason why something like this won't work for you to attach the GO Ids? merge(dat.all, toTable(org.Hs.egGO), by.x="EG", by.y="gene_id") When I build the org.Hs.eg.db package, I download the IPI Ids in the mySQL database from the following source: ftp://ftp.ebi.ac.uk/pub/databases/IPI/current The most recent time I did this (last week) results in an ipi to gene table that contains only 86719 unique IPIs. And that number drops a bit when I fold the IPI IDs into the org database (which takes as its set of unique entrez gene IDs the ones that are currently listed at NCBI) to 77387 distinct IPI IDs. And if you merge with a GO table, I expect that it will drop a bit more. In looking at the file that you are parsing, I get exactly the same number of unique IPI ids if I extract from the ID fields and match to the Entrez Gene fields (which also gives me the exact same number of entrez gene IDs). I can only get the huge additional number of IPI Ids from this file if I also mine the AC field and assume that these IPI Ids also should map to the exact same things. But the direct database dump from EBI does not give me these mappings. In fact, it does not seem to even contain them. This causes me to be concerned that maybe these IDs may not what you think they are? Anyhow, I hope this helps, Marc On 03/08/2011 11:39 AM, Dick Beyer wrote: > Hi to all, > > For several years now, I have been doing GO analysis on lists of > proteins derived from MS. I am given IPIs by the proteomics folks and > need the corresponding Entrez Gene IDs. Putting aside the issues of > non-unique mapping from IPI to EG, isoforms, etc., I was wondering if > anyone would comment on my method of getting the Entrez Gene IDs. I'd > really like to use Marc Carlson's merge method (shown below), but that > approach seems to miss several thousand IPI/EG matches that my method > finds. > > I start with > ftp://ftp.ebi.ac.uk/pub/databases/IPI/current/ipi.HUMAN.dat.gz, and > extract a subset of the rows: > > ipiHUMAN <- readLines(con = "ipi.HUMAN.3.80.dat",n=-1) # build 11feb2011 > dbfetch.all <- ipiHUMAN > rm(ipiHUMAN) > > # Explanation of the data format is found here > # http://www.ebi.ac.uk/2can/tutorials/formats.html#swiss > > length(dbfetch.all) # 3180244 > length(eg <- grep("^DR Entrez Gene", dbfetch.all)) # 80296 > length(ids <- grep("^ID", dbfetch.all)) # 86719 > length(de <- grep("^DE", dbfetch.all)) # 92454 > length(ac <- grep("^AC", dbfetch.all)) # 93720 > length(ug <- grep("^DR UniGene", dbfetch.all)) # 88314 > length(up <- grep("^DR UniProtKB", dbfetch.all)) # 110593 > length(en <- grep("^DR ENSEMBL", dbfetch.all)) # 77340 > length(rs <- grep("^DR REFSEQ_REVIEWED",dbfetch.all)) # 14559 > > and eventually turn this into a data.frame with the columns: > > "IPI","EG","GeneSymbol","UniGene","UniProtKB","ENSEMBL","REFSEQ_REVI EWED" > > (Note: Not every IPI entry has every field) > > For this build of the IPI file, my data.frame ends up as > dim(dat.all) > [1] 183153 7 > > Of these 183153 IPIs, there are 171909 unique IPIs, and 22342 unique > Entrez Gene IDs. > > The merge method shown below from Marc Carlson gives 69315 unique IPIs > and 17783 unique Entrez Gene IDs (you get the same numbers whether you > use org.Hs.egGO2ALLEGS or org.Hs.egGO). > > When I build my 7 column data.frame, I initially get 22305 unique > Entrez Gene IDs, and I then go through some additional steps of trying > to fill in the missing EGs. I do this by taking the IPIs with no EGs, > and using biomaRt with UniGene, UniProtKB etc as inputs to getBM(), > and hope I get a few more EGs. > > For example: > > library(biomaRt) > mart <- useMart( "ensembl", dataset="hsapiens_gene_ensembl") > length(whichis.na(dat.all[,4]))) > sum(z <- !is.na(dat.all[,4])) > w <- > getBM(attributes=c("entrezgene","unigene","hgnc_symbol","description "),filters="unigene",values=dat.all[z,4], > mart=mart) > > By doing several of these getBM() steps, I add 37 more EGs! > > My method is long and painful. That merge approach is clean and > beautiful. > > Is there a way to add to the merge argument or something that would > give me the additional 100K+ IPIs and 4500+ EGs? > > ------------------------------ > Message: 20 > Date: Fri, 18 Feb 2011 13:17:18 -0800 > From: "Carlson, Marc R" <mcarlson at="" fhcrc.org=""> > To: <bioconductor at="" stat.math.ethz.ch=""> > Subject: Re: [BioC] IPI to entrez id > Message-ID: > <1688456294.5987.1298063838120.JavaMail.root at zimbra4.fhcrc.org> > Content-Type: text/plain; charset="utf-8" > > Hi Viritha, > > These things can never be 1:1, but you can pretty easily just cram > them all into a huge data.frame by doing this: > > library(org.Hs.eg.db) > allAnnots <- merge(toTable(org.Hs.egPROSITE), toTable(org.Hs.egGO), > by.x="gene_id", by.y="gene_id") > head(allAnnots) > > Once you have done this, you may notice that they are not only are > these things almost never (if ever) 1:1, but that this could have been > even worse if I had used the GO2ALL mappings (and I probably should > have, but I can't really tell because I have almost no information > about what you want to do). Anyhow, I hope this helps you, but if you > have a more specific use for this information that you are willing to > talk about then we might be able to give you a better answer. > > > Marc > ------------------------------ > > Thanks very much, > Dick > ******************************************************************** *********** > > Richard P. Beyer, Ph.D. University of Washington > Tel.:(206) 616 7378 Env. & Occ. Health Sci. , Box 354695 > Fax: (206) 685 4696 4225 Roosevelt Way NE, # 100 > Seattle, WA 98105-6099 > http://depts.washington.edu/ceeh/members_fc_bioinfo.html > http://staff.washington.edu/~dbeyer > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor

ADD COMMENT • link 14.7 years ago Marc Carlson ★ 7.2k

0

Entering edit mode

Dick Beyer ★ 1.4k

@dick-beyer-26

Last seen 11.2 years ago

Hi Marc, Thanks very much for the merge example. That's so much cleaner than my usual approach. As far as the differing numbers of IPIs, I agree that using the AC field gives a lot more IPIs than just using the ID field. I guess I'm trying to solve a different problem than other folks. I get these lists of IPIs from proteomics folks, and some of the IPIs wouldn't show up in the ID field of "ftp://ftp.ebi.ac.uk/pub/databases/IPI/current/ipi.HUMAN.dat.gz", but are present in the AC field because they are no longer the current primary identifier (http://www.ebi.ac.uk/IPI/Algorithm.html#SECONDARIES). What I am really after is the Entrez Gene ID, so for me, the IPI is just a pointer to the EG. However, with this approach: library(org.Hs.eg.db) allAnnots <- merge(toTable(org.Hs.egPROSITE), toTable(org.Hs.egGO), by.x="gene_id", by.y="gene_id") I get several thousand fewer EGs than using either ipi.HUMAN.dat or ipi.HUMAN.xrefs. I looked at a few of the EGs that I get out of these that are not in org.Hs.eg.db, and they seem valid. I'll do some more checking to be sure. In ipi.HUMAN.xrefs I get 22113 unique EGs and 86719 unique IPIs from a download 2 days ago from "ftp://ftp.ebi.ac.uk/pub/databases/IPI/current/ipi.HUMAN.xrefs.gz" What I had hoped I was creating was a table with any and all IPIs (primary and secondary) but each with valid EGs. Thanks very much for your great help, Dick ------------------------------ Message: 24 Date: Wed, 09 Mar 2011 17:25:34 -0800 From: Marc Carlson <mcarlson@fhcrc.org> To: bioconductor at r-project.org Subject: Re: [BioC] IPI to entrez id Message-ID: <4D78288E.6060904 at fhcrc.org> Content-Type: text/plain; charset=ISO-8859-1 Hi Dick, Is there any reason why something like this won't work for you to attach the GO Ids? merge(dat.all, toTable(org.Hs.egGO), by.x="EG", by.y="gene_id") When I build the org.Hs.eg.db package, I download the IPI Ids in the mySQL database from the following source: ftp://ftp.ebi.ac.uk/pub/databases/IPI/current The most recent time I did this (last week) results in an ipi to gene table that contains only 86719 unique IPIs. And that number drops a bit when I fold the IPI IDs into the org database (which takes as its set of unique entrez gene IDs the ones that are currently listed at NCBI) to 77387 distinct IPI IDs. And if you merge with a GO table, I expect that it will drop a bit more. In looking at the file that you are parsing, I get exactly the same number of unique IPI ids if I extract from the ID fields and match to the Entrez Gene fields (which also gives me the exact same number of entrez gene IDs). I can only get the huge additional number of IPI Ids from this file if I also mine the AC field and assume that these IPI Ids also should map to the exact same things. But the direct database dump from EBI does not give me these mappings. In fact, it does not seem to even contain them. This causes me to be concerned that maybe these IDs may not what you think they are? Anyhow, I hope this helps, Marc On 03/08/2011 11:39 AM, Dick Beyer wrote: Hi to all, For several years now, I have been doing GO analysis on lists of proteins derived from MS. I am given IPIs by the proteomics folks and need the corresponding Entrez Gene IDs. Putting aside the issues of non-unique mapping from IPI to EG, isoforms, etc., I was wondering if anyone would comment on my method of getting the Entrez Gene IDs. I'd really like to use Marc Carlson's merge method (shown below), but that approach seems to miss several thousand IPI/EG matches that my method finds. I start with ftp://ftp.ebi.ac.uk/pub/databases/IPI/current/ipi.HUMAN.dat.gz, and extract a subset of the rows: ipiHUMAN <- readLines(con = "ipi.HUMAN.3.80.dat",n=-1) # build 11feb2011 dbfetch.all <- ipiHUMAN rm(ipiHUMAN) # Explanation of the data format is found here # http://www.ebi.ac.uk/2can/tutorials/formats.html#swiss length(dbfetch.all) # 3180244 length(eg <- grep("^DR Entrez Gene", dbfetch.all)) # 80296 length(ids <- grep("^ID", dbfetch.all)) # 86719 length(de <- grep("^DE", dbfetch.all)) # 92454 length(ac <- grep("^AC", dbfetch.all)) # 93720 length(ug <- grep("^DR UniGene", dbfetch.all)) # 88314 length(up <- grep("^DR UniProtKB", dbfetch.all)) # 110593 length(en <- grep("^DR ENSEMBL", dbfetch.all)) # 77340 length(rs <- grep("^DR REFSEQ_REVIEWED",dbfetch.all)) # 14559 and eventually turn this into a data.frame with the columns: "IPI","EG","GeneSymbol","UniGene","UniProtKB","ENSEMBL","REFSEQ_REVIEW ED" (Note: Not every IPI entry has every field) For this build of the IPI file, my data.frame ends up as dim(dat.all) [1] 183153 7 Of these 183153 IPIs, there are 171909 unique IPIs, and 22342 unique Entrez Gene IDs. The merge method shown below from Marc Carlson gives 69315 unique IPIs and 17783 unique Entrez Gene IDs (you get the same numbers whether you use org.Hs.egGO2ALLEGS or org.Hs.egGO). When I build my 7 column data.frame, I initially get 22305 unique Entrez Gene IDs, and I then go through some additional steps of trying to fill in the missing EGs. I do this by taking the IPIs with no EGs, and using biomaRt with UniGene, UniProtKB etc as inputs to getBM(), and hope I get a few more EGs. For example: library(biomaRt) mart <- useMart( "ensembl", dataset="hsapiens_gene_ensembl") length(whichis.na(dat.all[,4]))) sum(z <- !is.na(dat.all[,4])) w <- getBM(attributes=c("entrezgene","unigene","hgnc_symbol","description") ,filters="unigene",values=dat.all[z,4], mart=mart) By doing several of these getBM() steps, I add 37 more EGs! My method is long and painful. That merge approach is clean and beautiful. Is there a way to add to the merge argument or something that would give me the additional 100K+ IPIs and 4500+ EGs? ------------------------------ Message: 20 Date: Fri, 18 Feb 2011 13:17:18 -0800 From: "Carlson, Marc R" <mcarlson@fhcrc.org> To: <bioconductor at="" stat.math.ethz.ch=""> Subject: Re: [BioC] IPI to entrez id Message-ID: <1688456294.5987.1298063838120.JavaMail.root at zimbra4.fhcrc.org> Content-Type: text/plain; charset="utf-8" Hi Viritha, These things can never be 1:1, but you can pretty easily just cram them all into a huge data.frame by doing this: library(org.Hs.eg.db) allAnnots <- merge(toTable(org.Hs.egPROSITE), toTable(org.Hs.egGO), by.x="gene_id", by.y="gene_id") head(allAnnots) Once you have done this, you may notice that they are not only are these things almost never (if ever) 1:1, but that this could have been even worse if I had used the GO2ALL mappings (and I probably should have, but I can't really tell because I have almost no information about what you want to do). Anyhow, I hope this helps you, but if you have a more specific use for this information that you are willing to talk about then we might be able to give you a better answer. Marc ------------------------------ Thanks very much, Dick ********************************************************************** ********* Richard P. Beyer, Ph.D. University of Washington Tel.:(206) 616 7378 Env. & Occ. Health Sci. , Box 354695 Fax: (206) 685 4696 4225 Roosevelt Way NE, # 100 Seattle, WA 98105-6099 http://depts.washington.edu/ceeh/members_fc_bioinfo.html http://staff.washington.edu/~dbeyer _______________________________________________ Bioconductor mailing list Bioconductor at r-project.org https://stat.ethz.ch/mailman/listinfo/bioconductor Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor ********************************************************************** ********* Richard P. Beyer, Ph.D. University of Washington Tel.:(206) 616 7378 Env. & Occ. Health Sci. , Box 354695 Fax: (206) 685 4696 4225 Roosevelt Way NE, # 100 Seattle, WA 98105-6099 http://depts.washington.edu/ceeh/members_fc_bioinfo.html http://staff.washington.edu/~dbeyer

ADD COMMENT • link 14.7 years ago Dick Beyer ★ 1.4k

0

Entering edit mode

Hi Dick, So lets look at that example that you gave in your last post where you merged the GO table with the PROSITE one. It is important to understand that when you call merge(), it does it's magic by basically performing an inner join on the two tables that comprise it's 1st two first arguments. Therefore the mechanics of how that merge will do its job mean that you have effectively restricted the results to only those entrez genes where you have BOTH a GO annotation AND a PROSITE annotation. So the result you see (more EGs from the bigger join) would happen if for example you had increased the size of the IPI tables (by pairing up some deprecated IPI ids with some legitimate entrez gene IDs). In this situation, these entrez gene IDs would be perfectly legitimate, but their IPI IDs would all be older deprecated IPI ids. I am not sure if that is what you really want or not, but if it is, then the final table indeed would be bigger. Hope that clarifies things, Marc On 03/10/2011 07:16 AM, Dick Beyer wrote: > Hi Marc, > > Thanks very much for the merge example. That's so much cleaner than my usual approach. > > As far as the differing numbers of IPIs, I agree that using the AC field gives a lot more IPIs than just using the ID field. I guess I'm trying to solve a different problem than other folks. I get these lists of IPIs from proteomics folks, and some of the IPIs wouldn't show up in the ID field of "ftp://ftp.ebi.ac.uk/pub/databases/IPI/current/ipi.HUMAN.dat.gz", but are present in the AC field because they are no longer the current primary identifier (http://www.ebi.ac.uk/IPI/Algorithm.html#SECONDARIES). What I am really after is the Entrez Gene ID, so for me, the IPI is just a pointer to the EG. > > However, with this approach: > > library(org.Hs.eg.db) > allAnnots <- merge(toTable(org.Hs.egPROSITE), toTable(org.Hs.egGO), by.x="gene_id", by.y="gene_id") > > I get several thousand fewer EGs than using either ipi.HUMAN.dat or ipi.HUMAN.xrefs. I looked at a few of the EGs that I get out of these that are not in org.Hs.eg.db, and they seem valid. I'll do some more checking to be sure. In ipi.HUMAN.xrefs I get 22113 unique EGs and 86719 unique IPIs from a download 2 days ago from "ftp://ftp.ebi.ac.uk/pub/databases/IPI/current/ipi.HUMAN.xrefs.gz" > > What I had hoped I was creating was a table with any and all IPIs (primary and secondary) but each with valid EGs. > > Thanks very much for your great help, > Dick > ------------------------------ > > Message: 24 > Date: Wed, 09 Mar 2011 17:25:34 -0800 > From: Marc Carlson <mcarlson at="" fhcrc.org=""> > To: bioconductor at r-project.org > Subject: Re: [BioC] IPI to entrez id > Message-ID: <4D78288E.6060904 at fhcrc.org> > Content-Type: text/plain; charset=ISO-8859-1 > > Hi Dick, > > Is there any reason why something like this won't work for you to attach > the GO Ids? > > merge(dat.all, toTable(org.Hs.egGO), by.x="EG", by.y="gene_id") > > > > When I build the org.Hs.eg.db package, I download the IPI Ids in the > mySQL database from the following source: > > ftp://ftp.ebi.ac.uk/pub/databases/IPI/current > > > The most recent time I did this (last week) results in an ipi to gene > table that contains only 86719 unique IPIs. And that number drops a bit > when I fold the IPI IDs into the org database (which takes as its set of > unique entrez gene IDs the ones that are currently listed at NCBI) to > 77387 distinct IPI IDs. And if you merge with a GO table, I expect that > it will drop a bit more. > > In looking at the file that you are parsing, I get exactly the same > number of unique IPI ids if I extract from the ID fields and match to > the Entrez Gene fields (which also gives me the exact same number of > entrez gene IDs). I can only get the huge additional number of IPI Ids > from this file if I also mine the AC field and assume that these IPI Ids > also should map to the exact same things. But the direct database dump > from EBI does not give me these mappings. In fact, it does not seem to > even contain them. This causes me to be concerned that maybe these IDs > may not what you think they are? > > > Anyhow, I hope this helps, > > > Marc > > > > > On 03/08/2011 11:39 AM, Dick Beyer wrote: > Hi to all, > > For several years now, I have been doing GO analysis on lists of > proteins derived from MS. I am given IPIs by the proteomics folks and > need the corresponding Entrez Gene IDs. Putting aside the issues of > non-unique mapping from IPI to EG, isoforms, etc., I was wondering if > anyone would comment on my method of getting the Entrez Gene IDs. I'd > really like to use Marc Carlson's merge method (shown below), but that > approach seems to miss several thousand IPI/EG matches that my method > finds. > > I start with > ftp://ftp.ebi.ac.uk/pub/databases/IPI/current/ipi.HUMAN.dat.gz, and > extract a subset of the rows: > > ipiHUMAN <- readLines(con = "ipi.HUMAN.3.80.dat",n=-1) # build 11feb2011 > dbfetch.all <- ipiHUMAN > rm(ipiHUMAN) > > # Explanation of the data format is found here > # http://www.ebi.ac.uk/2can/tutorials/formats.html#swiss > > length(dbfetch.all) # 3180244 > length(eg <- grep("^DR Entrez Gene", dbfetch.all)) # 80296 > length(ids <- grep("^ID", dbfetch.all)) # 86719 > length(de <- grep("^DE", dbfetch.all)) # 92454 > length(ac <- grep("^AC", dbfetch.all)) # 93720 > length(ug <- grep("^DR UniGene", dbfetch.all)) # 88314 > length(up <- grep("^DR UniProtKB", dbfetch.all)) # 110593 > length(en <- grep("^DR ENSEMBL", dbfetch.all)) # 77340 > length(rs <- grep("^DR REFSEQ_REVIEWED",dbfetch.all)) # 14559 > > and eventually turn this into a data.frame with the columns: > > "IPI","EG","GeneSymbol","UniGene","UniProtKB","ENSEMBL","REFSEQ_REVI EWED" > > (Note: Not every IPI entry has every field) > > For this build of the IPI file, my data.frame ends up as > dim(dat.all) > [1] 183153 7 > > Of these 183153 IPIs, there are 171909 unique IPIs, and 22342 unique > Entrez Gene IDs. > > The merge method shown below from Marc Carlson gives 69315 unique IPIs > and 17783 unique Entrez Gene IDs (you get the same numbers whether you > use org.Hs.egGO2ALLEGS or org.Hs.egGO). > > When I build my 7 column data.frame, I initially get 22305 unique > Entrez Gene IDs, and I then go through some additional steps of trying > to fill in the missing EGs. I do this by taking the IPIs with no EGs, > and using biomaRt with UniGene, UniProtKB etc as inputs to getBM(), > and hope I get a few more EGs. > > For example: > > library(biomaRt) > mart <- useMart( "ensembl", dataset="hsapiens_gene_ensembl") > length(whichis.na(dat.all[,4]))) > sum(z <- !is.na(dat.all[,4])) > w <- > getBM(attributes=c("entrezgene","unigene","hgnc_symbol","description "),filters="unigene",values=dat.all[z,4], > mart=mart) > > By doing several of these getBM() steps, I add 37 more EGs! > > My method is long and painful. That merge approach is clean and > beautiful. > > Is there a way to add to the merge argument or something that would > give me the additional 100K+ IPIs and 4500+ EGs? > > ------------------------------ > Message: 20 > Date: Fri, 18 Feb 2011 13:17:18 -0800 > From: "Carlson, Marc R" <mcarlson at="" fhcrc.org=""> > To: <bioconductor at="" stat.math.ethz.ch=""> > Subject: Re: [BioC] IPI to entrez id > Message-ID: > <1688456294.5987.1298063838120.JavaMail.root at zimbra4.fhcrc.org> > Content-Type: text/plain; charset="utf-8" > > Hi Viritha, > > These things can never be 1:1, but you can pretty easily just cram > them all into a huge data.frame by doing this: > > library(org.Hs.eg.db) > allAnnots <- merge(toTable(org.Hs.egPROSITE), toTable(org.Hs.egGO), > by.x="gene_id", by.y="gene_id") > head(allAnnots) > > Once you have done this, you may notice that they are not only are > these things almost never (if ever) 1:1, but that this could have been > even worse if I had used the GO2ALL mappings (and I probably should > have, but I can't really tell because I have almost no information > about what you want to do). Anyhow, I hope this helps you, but if you > have a more specific use for this information that you are willing to > talk about then we might be able to give you a better answer. > > > Marc > ------------------------------ > > Thanks very much, > Dick > ******************************************************************** *********** > > Richard P. Beyer, Ph.D. University of Washington > Tel.:(206) 616 7378 Env. & Occ. Health Sci. , Box 354695 > Fax: (206) 685 4696 4225 Roosevelt Way NE, # 100 > Seattle, WA 98105-6099 > http://depts.washington.edu/ceeh/members_fc_bioinfo.html > http://staff.washington.edu/~dbeyer > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor > > > > ******************************************************************** *********** > Richard P. Beyer, Ph.D. University of Washington > Tel.:(206) 616 7378 Env. & Occ. Health Sci. , Box 354695 > Fax: (206) 685 4696 4225 Roosevelt Way NE, # 100 > Seattle, WA 98105-6099 > http://depts.washington.edu/ceeh/members_fc_bioinfo.html > http://staff.washington.edu/~dbeyer > ******************************************************************** *********** > > >

ADD REPLY • link 14.7 years ago Marc Carlson ★ 7.2k

0

Entering edit mode

Hi Marc, Thanks for the good explanation. I do want the deprecated IPIs (or get the proteomics folks I work with to update things at their end more frequently?). Maybe it would be cleaner to just massage the IPIs I get so as to change deprecated IPIs to current ones, then use the merged GO and PROSITE table. I'll have to play around with that and see. I fuss around a lot trying to get every possible Entrez Gene ID for the IPIs I deal with. So when I get Entrez Gene IDs for a subset of a list of IPIs, I go through several extra steps (biomaRt etc) to try and find Entrez Gene IDs for the IPIs that are missing them. Thanks very much, Dick ********************************************************************** ********* Richard P. Beyer, Ph.D. University of Washington Tel.:(206) 616 7378 Env. & Occ. Health Sci. , Box 354695 Fax: (206) 685 4696 4225 Roosevelt Way NE, # 100 Seattle, WA 98105-6099 http://depts.washington.edu/ceeh/members_fc_bioinfo.html http://staff.washington.edu/~dbeyer ********************************************************************** ********* On Thu, 10 Mar 2011, Marc Carlson wrote: > Hi Dick, > > So lets look at that example that you gave in your last post where you > merged the GO table with the PROSITE one. It is important to understand > that when you call merge(), it does it's magic by basically performing > an inner join on the two tables that comprise it's 1st two first > arguments. Therefore the mechanics of how that merge will do its job > mean that you have effectively restricted the results to only those > entrez genes where you have BOTH a GO annotation AND a PROSITE annotation. > > So the result you see (more EGs from the bigger join) would happen if > for example you had increased the size of the IPI tables (by pairing up > some deprecated IPI ids with some legitimate entrez gene IDs). In this > situation, these entrez gene IDs would be perfectly legitimate, but > their IPI IDs would all be older deprecated IPI ids. I am not sure if > that is what you really want or not, but if it is, then the final table > indeed would be bigger. > > > Hope that clarifies things, > > > Marc > > > > > > On 03/10/2011 07:16 AM, Dick Beyer wrote: >> Hi Marc, >> >> Thanks very much for the merge example. That's so much cleaner than my usual approach. >> >> As far as the differing numbers of IPIs, I agree that using the AC field gives a lot more IPIs than just using the ID field. I guess I'm trying to solve a different problem than other folks. I get these lists of IPIs from proteomics folks, and some of the IPIs wouldn't show up in the ID field of "ftp://ftp.ebi.ac.uk/pub/databases/IPI/current/ipi.HUMAN.dat.gz", but are present in the AC field because they are no longer the current primary identifier (http://www.ebi.ac.uk/IPI/Algorithm.html#SECONDARIES). What I am really after is the Entrez Gene ID, so for me, the IPI is just a pointer to the EG. >> >> However, with this approach: >> >> library(org.Hs.eg.db) >> allAnnots <- merge(toTable(org.Hs.egPROSITE), toTable(org.Hs.egGO), by.x="gene_id", by.y="gene_id") >> >> I get several thousand fewer EGs than using either ipi.HUMAN.dat or ipi.HUMAN.xrefs. I looked at a few of the EGs that I get out of these that are not in org.Hs.eg.db, and they seem valid. I'll do some more checking to be sure. In ipi.HUMAN.xrefs I get 22113 unique EGs and 86719 unique IPIs from a download 2 days ago from "ftp://ftp.ebi.ac.uk/pub/databases/IPI/current/ipi.HUMAN.xrefs.gz" >> >> What I had hoped I was creating was a table with any and all IPIs (primary and secondary) but each with valid EGs. >> >> Thanks very much for your great help, >> Dick >> ------------------------------ >> >> Message: 24 >> Date: Wed, 09 Mar 2011 17:25:34 -0800 >> From: Marc Carlson <mcarlson at="" fhcrc.org=""> >> To: bioconductor at r-project.org >> Subject: Re: [BioC] IPI to entrez id >> Message-ID: <4D78288E.6060904 at fhcrc.org> >> Content-Type: text/plain; charset=ISO-8859-1 >> >> Hi Dick, >> >> Is there any reason why something like this won't work for you to attach >> the GO Ids? >> >> merge(dat.all, toTable(org.Hs.egGO), by.x="EG", by.y="gene_id") >> >> >> >> When I build the org.Hs.eg.db package, I download the IPI Ids in the >> mySQL database from the following source: >> >> ftp://ftp.ebi.ac.uk/pub/databases/IPI/current >> >> >> The most recent time I did this (last week) results in an ipi to gene >> table that contains only 86719 unique IPIs. And that number drops a bit >> when I fold the IPI IDs into the org database (which takes as its set of >> unique entrez gene IDs the ones that are currently listed at NCBI) to >> 77387 distinct IPI IDs. And if you merge with a GO table, I expect that >> it will drop a bit more. >> >> In looking at the file that you are parsing, I get exactly the same >> number of unique IPI ids if I extract from the ID fields and match to >> the Entrez Gene fields (which also gives me the exact same number of >> entrez gene IDs). I can only get the huge additional number of IPI Ids >> from this file if I also mine the AC field and assume that these IPI Ids >> also should map to the exact same things. But the direct database dump >> from EBI does not give me these mappings. In fact, it does not seem to >> even contain them. This causes me to be concerned that maybe these IDs >> may not what you think they are? >> >> >> Anyhow, I hope this helps, >> >> >> Marc >> >> >> >> >> On 03/08/2011 11:39 AM, Dick Beyer wrote: >> Hi to all, >> >> For several years now, I have been doing GO analysis on lists of >> proteins derived from MS. I am given IPIs by the proteomics folks and >> need the corresponding Entrez Gene IDs. Putting aside the issues of >> non-unique mapping from IPI to EG, isoforms, etc., I was wondering if >> anyone would comment on my method of getting the Entrez Gene IDs. I'd >> really like to use Marc Carlson's merge method (shown below), but that >> approach seems to miss several thousand IPI/EG matches that my method >> finds. >> >> I start with >> ftp://ftp.ebi.ac.uk/pub/databases/IPI/current/ipi.HUMAN.dat.gz, and >> extract a subset of the rows: >> >> ipiHUMAN <- readLines(con = "ipi.HUMAN.3.80.dat",n=-1) # build 11feb2011 >> dbfetch.all <- ipiHUMAN >> rm(ipiHUMAN) >> >> # Explanation of the data format is found here >> # http://www.ebi.ac.uk/2can/tutorials/formats.html#swiss >> >> length(dbfetch.all) # 3180244 >> length(eg <- grep("^DR Entrez Gene", dbfetch.all)) # 80296 >> length(ids <- grep("^ID", dbfetch.all)) # 86719 >> length(de <- grep("^DE", dbfetch.all)) # 92454 >> length(ac <- grep("^AC", dbfetch.all)) # 93720 >> length(ug <- grep("^DR UniGene", dbfetch.all)) # 88314 >> length(up <- grep("^DR UniProtKB", dbfetch.all)) # 110593 >> length(en <- grep("^DR ENSEMBL", dbfetch.all)) # 77340 >> length(rs <- grep("^DR REFSEQ_REVIEWED",dbfetch.all)) # 14559 >> >> and eventually turn this into a data.frame with the columns: >> >> "IPI","EG","GeneSymbol","UniGene","UniProtKB","ENSEMBL","REFSEQ_REV IEWED" >> >> (Note: Not every IPI entry has every field) >> >> For this build of the IPI file, my data.frame ends up as >> dim(dat.all) >> [1] 183153 7 >> >> Of these 183153 IPIs, there are 171909 unique IPIs, and 22342 unique >> Entrez Gene IDs. >> >> The merge method shown below from Marc Carlson gives 69315 unique IPIs >> and 17783 unique Entrez Gene IDs (you get the same numbers whether you >> use org.Hs.egGO2ALLEGS or org.Hs.egGO). >> >> When I build my 7 column data.frame, I initially get 22305 unique >> Entrez Gene IDs, and I then go through some additional steps of trying >> to fill in the missing EGs. I do this by taking the IPIs with no EGs, >> and using biomaRt with UniGene, UniProtKB etc as inputs to getBM(), >> and hope I get a few more EGs. >> >> For example: >> >> library(biomaRt) >> mart <- useMart( "ensembl", dataset="hsapiens_gene_ensembl") >> length(whichis.na(dat.all[,4]))) >> sum(z <- !is.na(dat.all[,4])) >> w <- >> getBM(attributes=c("entrezgene","unigene","hgnc_symbol","descriptio n"),filters="unigene",values=dat.all[z,4], >> mart=mart) >> >> By doing several of these getBM() steps, I add 37 more EGs! >> >> My method is long and painful. That merge approach is clean and >> beautiful. >> >> Is there a way to add to the merge argument or something that would >> give me the additional 100K+ IPIs and 4500+ EGs? >> >> ------------------------------ >> Message: 20 >> Date: Fri, 18 Feb 2011 13:17:18 -0800 >> From: "Carlson, Marc R" <mcarlson at="" fhcrc.org=""> >> To: <bioconductor at="" stat.math.ethz.ch=""> >> Subject: Re: [BioC] IPI to entrez id >> Message-ID: >> <1688456294.5987.1298063838120.JavaMail.root at zimbra4.fhcrc.org> >> Content-Type: text/plain; charset="utf-8" >> >> Hi Viritha, >> >> These things can never be 1:1, but you can pretty easily just cram >> them all into a huge data.frame by doing this: >> >> library(org.Hs.eg.db) >> allAnnots <- merge(toTable(org.Hs.egPROSITE), toTable(org.Hs.egGO), >> by.x="gene_id", by.y="gene_id") >> head(allAnnots) >> >> Once you have done this, you may notice that they are not only are >> these things almost never (if ever) 1:1, but that this could have been >> even worse if I had used the GO2ALL mappings (and I probably should >> have, but I can't really tell because I have almost no information >> about what you want to do). Anyhow, I hope this helps you, but if you >> have a more specific use for this information that you are willing to >> talk about then we might be able to give you a better answer. >> >> >> Marc >> ------------------------------ >> >> Thanks very much, >> Dick >> ******************************************************************* ************ >> >> Richard P. Beyer, Ph.D. University of Washington >> Tel.:(206) 616 7378 Env. & Occ. Health Sci. , Box 354695 >> Fax: (206) 685 4696 4225 Roosevelt Way NE, # 100 >> Seattle, WA 98105-6099 >> http://depts.washington.edu/ceeh/members_fc_bioinfo.html >> http://staff.washington.edu/~dbeyer >> >> _______________________________________________ >> Bioconductor mailing list >> Bioconductor at r-project.org >> https://stat.ethz.ch/mailman/listinfo/bioconductor >> Search the archives: >> http://news.gmane.org/gmane.science.biology.informatics.conductor >> >> >> >> ******************************************************************* ************ >> Richard P. Beyer, Ph.D. University of Washington >> Tel.:(206) 616 7378 Env. & Occ. Health Sci. , Box 354695 >> Fax: (206) 685 4696 4225 Roosevelt Way NE, # 100 >> Seattle, WA 98105-6099 >> http://depts.washington.edu/ceeh/members_fc_bioinfo.html >> http://staff.washington.edu/~dbeyer >> ******************************************************************* ************ >> >> >> > >

ADD REPLY • link 14.7 years ago Dick Beyer ★ 1.4k

Login before adding your answer.