from rat codelink to human locuslink
3
0
Entering edit mode
Weiwei Shi ★ 1.2k
@weiwei-shi-1407
Last seen 9.7 years ago
hi, I am wondering if bioconductor can help me do some conversion from rat codelink probe id's to human locuslink id's? A further, more general question is, is there some package which can handle this kind of conversion between different id systems and different species? Thank you, -- Weiwei Shi, Ph.D Research Scientist GeneGO, Inc. "Did you always know?" "No, I did not. But I believed..." ---Matrix III
probe probe • 1.9k views
ADD COMMENT
0
Entering edit mode
@sean-davis-490
Last seen 4 months ago
United States
Weiwei Shi wrote: > hi, > > I am wondering if bioconductor can help me do some conversion from rat > codelink probe id's to human locuslink id's? > > A further, more general question is, is there some package which can > handle this kind of conversion between different id systems and > different species? > You will probably need to convert your codelink probe IDs to some useful public ID. I don't know what codelink supplies as a mapping for their probe IDs. However, you can then probably use the biomart package for a general conversion between different id systems in different species. Sean
ADD COMMENT
0
Entering edit mode
Hi, I have 3 examples like this: Probe_ID UniGene_ID UniGene_Name AA799301_PROBE1 Rn.107913 Lgtn protein (DBSS) AA799313_PROBE1 Rn.32316 "sialyltransferase 10 (alpha-2,3-sialyltransferase VI)" AA799329_PROBE1 Rn.112856 RIKEN cDNA 4632417K18 (Mm.) (DBSS) I think the UniGene_ID might work for the purpose of using biomartRt package (is it what you meant by biomart?). But the thing is, I look through the package intro but I did not find how to convert between species. Should I choose dataset for rat first, and then use rat2human conversion (i have a local program to do that but I am curious how biomartRt or other packages in R do this?) thanks, weiwei On 11/2/06, Sean Davis <sdavis2 at="" mail.nih.gov=""> wrote: > Weiwei Shi wrote: > > hi, > > > > I am wondering if bioconductor can help me do some conversion from rat > > codelink probe id's to human locuslink id's? > > > > A further, more general question is, is there some package which can > > handle this kind of conversion between different id systems and > > different species? > > > You will probably need to convert your codelink probe IDs to some useful > public ID. I don't know what codelink supplies as a mapping for their > probe IDs. However, you can then probably use the biomart package for a > general conversion between different id systems in different species. > > Sean > -- Weiwei Shi, Ph.D Research Scientist GeneGO, Inc. "Did you always know?" "No, I did not. But I believed..." ---Matrix III
ADD REPLY
0
Entering edit mode
Weiwei Shi wrote: > Hi, > > I have 3 examples like this: > Probe_ID UniGene_ID UniGene_Name > AA799301_PROBE1 Rn.107913 Lgtn protein (DBSS) > AA799313_PROBE1 Rn.32316 "sialyltransferase 10 > (alpha-2,3-sialyltransferase VI)" > AA799329_PROBE1 Rn.112856 RIKEN cDNA 4632417K18 (Mm.) (DBSS) > > I think the UniGene_ID might work for the purpose of using biomartRt > package (is it what you meant by biomart?). But the thing is, I look > through the package intro but I did not find how to convert between > species. Should I choose dataset for rat first, and then use rat2human > conversion (i have a local program to do that but I am curious how > biomartRt or other packages in R do this?) Hi, Weiwei. You'll probably want to look at the help pages for biomaRt (note the correct capitalization--sorry for the confusion). To see a list of help pages, you can use the simple command: > help(package=biomaRt) There are a couple of functions that look promising: getXref and getHomolog. You might want to look into those a bit. As for your probe ID's, it looks like they are a concatenation of a Genbank accession number and "PROBE1", so those could be useful. Unigene ID could also potentially be useful, but that depends a bit on how old the annotation is, as Unigene IDs change and are deleted on a regular basis as part of each new unigene build. Sean
ADD REPLY
0
Entering edit mode
On Thu, 2 Nov 2006, Sean Davis wrote: > Weiwei Shi wrote: > > Hi, > > > > I have 3 examples like this: > > Probe_ID UniGene_ID UniGene_Name > > AA799301_PROBE1 Rn.107913 Lgtn protein (DBSS) > > AA799313_PROBE1 Rn.32316 "sialyltransferase 10 > > (alpha-2,3-sialyltransferase VI)" > > AA799329_PROBE1 Rn.112856 RIKEN cDNA 4632417K18 (Mm.) (DBSS) > > > > I think the UniGene_ID might work for the purpose of using biomartRt > > package (is it what you meant by biomart?). But the thing is, I look > > through the package intro but I did not find how to convert between > > species. Should I choose dataset for rat first, and then use rat2human > > conversion (i have a local program to do that but I am curious how > > biomartRt or other packages in R do this?) > Hi, Weiwei. You'll probably want to look at the help pages for biomaRt > (note the correct capitalization--sorry for the confusion). To see a > list of help pages, you can use the simple command: > > > help(package=biomaRt) > > There are a couple of functions that look promising: getXref and > getHomolog. You might want to look into those a bit. > > As for your probe ID's, it looks like they are a concatenation of a > Genbank accession number and "PROBE1", so those could be useful. > Unigene ID could also potentially be useful, but that depends a bit on > how old the annotation is, as Unigene IDs change and are deleted on a > regular basis as part of each new unigene build. Actually, those probe ID's are not the currently used Codelink probe ID's but the LEGACY_PROBE_NAME. The annotation packages found in Bioconductor dont use this probe ids so they cannot be used to map to public identifiers. I wonder if it is also available the CUSTUMER_PROBE_NAME, which has the form GExxxxx (x being numbers, like GE12209) and is the identifier used by Codelink. D. > > Sean > > _______________________________________________ > Bioconductor mailing list > Bioconductor at stat.math.ethz.ch > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor >
ADD REPLY
0
Entering edit mode
yes, that's why i was confused b/c i checked some codelink and they start with GE. But i used that package and put the unigene id the data provides; some are recognized by biomaRt while some are not. I will try tomorrow (it is too late for today :( weiwei On 11/2/06, Diego Diez <diez at="" kuicr.kyoto-u.ac.jp=""> wrote: > On Thu, 2 Nov 2006, Sean Davis wrote: > > > Weiwei Shi wrote: > > > Hi, > > > > > > I have 3 examples like this: > > > Probe_ID UniGene_ID UniGene_Name > > > AA799301_PROBE1 Rn.107913 Lgtn protein (DBSS) > > > AA799313_PROBE1 Rn.32316 "sialyltransferase 10 > > > (alpha-2,3-sialyltransferase VI)" > > > AA799329_PROBE1 Rn.112856 RIKEN cDNA 4632417K18 (Mm.) (DBSS) > > > > > > I think the UniGene_ID might work for the purpose of using biomartRt > > > package (is it what you meant by biomart?). But the thing is, I look > > > through the package intro but I did not find how to convert between > > > species. Should I choose dataset for rat first, and then use rat2human > > > conversion (i have a local program to do that but I am curious how > > > biomartRt or other packages in R do this?) > > Hi, Weiwei. You'll probably want to look at the help pages for biomaRt > > (note the correct capitalization--sorry for the confusion). To see a > > list of help pages, you can use the simple command: > > > > > help(package=biomaRt) > > > > There are a couple of functions that look promising: getXref and > > getHomolog. You might want to look into those a bit. > > > > As for your probe ID's, it looks like they are a concatenation of a > > Genbank accession number and "PROBE1", so those could be useful. > > Unigene ID could also potentially be useful, but that depends a bit on > > how old the annotation is, as Unigene IDs change and are deleted on a > > regular basis as part of each new unigene build. > > Actually, those probe ID's are not the currently used Codelink probe ID's > but the LEGACY_PROBE_NAME. The annotation packages found in Bioconductor > dont use this probe ids so they cannot be used to map to public > identifiers. I wonder if it is also available the CUSTUMER_PROBE_NAME, > which has the form GExxxxx (x being numbers, like GE12209) and is the > identifier used by Codelink. > > D. > > > > > Sean > > > > _______________________________________________ > > Bioconductor mailing list > > Bioconductor at stat.math.ethz.ch > > https://stat.ethz.ch/mailman/listinfo/bioconductor > > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor > > > > -- Weiwei Shi, Ph.D Research Scientist GeneGO, Inc. "Did you always know?" "No, I did not. But I believed..." ---Matrix III
ADD REPLY
0
Entering edit mode
Hi Weiwei, Unfortunately Ensembl currently doesn't have codelink identifiers for the rat dataset, it only holds these in the human dataset. However if you have rat unigene identifiers there are two ways to get the corresponding human EntrezGene (= locuslink) identifiers. Here's how to do it: library(biomaRt) rat = useMart("ensembl", dataset="rnorvegicus_gene_ensembl") human = useMart("ensembl", dataset="hsapiens_gene_ensembl") ratUnigene = c("Rn.107913","Rn.32316","Rn.112856") #Now you can do the mapping using getBM in two steps using the human Ensembl gene identifiers as a way to go from rat to human humanEnsemblId = getBM(c("unigene", "human_ensembl_gene"), filters="unigene", values=ratUnigene, rat) humanEntrezGene = getBM(c("ensembl_gene_id", "entrezgene"), filters="ensembl_gene_id", values=humanEnsemblId[,2], human) #or you could use the getHomolog function which does this in one step. However it will only return the EntrezGene ids so if you start from a list of #unigene identifiers you'll get a list of human entrezgene identifiers but you can not match them up unless you do it one by one. I'll see if I can make #getHomolog to return both the identifier you start from and the identifier you want to retrieve so you can easily match up things getHomolog(id = ratUnigene, from.type="unigene", to.type="entrezgene", from.mart=rat, to.mart=human) Hope this helps, Steffen Weiwei Shi wrote: > yes, that's why i was confused b/c i checked some codelink and they > start with GE. But i used that package and put the unigene id the data > provides; some are recognized by biomaRt while some are not. > I will try tomorrow (it is too late for today :( > > weiwei > > On 11/2/06, Diego Diez <diez at="" kuicr.kyoto-u.ac.jp=""> wrote: > >> On Thu, 2 Nov 2006, Sean Davis wrote: >> >> >>> Weiwei Shi wrote: >>> >>>> Hi, >>>> >>>> I have 3 examples like this: >>>> Probe_ID UniGene_ID UniGene_Name >>>> AA799301_PROBE1 Rn.107913 Lgtn protein (DBSS) >>>> AA799313_PROBE1 Rn.32316 "sialyltransferase 10 >>>> (alpha-2,3-sialyltransferase VI)" >>>> AA799329_PROBE1 Rn.112856 RIKEN cDNA 4632417K18 (Mm.) (DBSS) >>>> >>>> I think the UniGene_ID might work for the purpose of using biomartRt >>>> package (is it what you meant by biomart?). But the thing is, I look >>>> through the package intro but I did not find how to convert between >>>> species. Should I choose dataset for rat first, and then use rat2human >>>> conversion (i have a local program to do that but I am curious how >>>> biomartRt or other packages in R do this?) >>>> >>> Hi, Weiwei. You'll probably want to look at the help pages for biomaRt >>> (note the correct capitalization--sorry for the confusion). To see a >>> list of help pages, you can use the simple command: >>> >>> > help(package=biomaRt) >>> >>> There are a couple of functions that look promising: getXref and >>> getHomolog. You might want to look into those a bit. >>> >>> As for your probe ID's, it looks like they are a concatenation of a >>> Genbank accession number and "PROBE1", so those could be useful. >>> Unigene ID could also potentially be useful, but that depends a bit on >>> how old the annotation is, as Unigene IDs change and are deleted on a >>> regular basis as part of each new unigene build. >>> >> Actually, those probe ID's are not the currently used Codelink probe ID's >> but the LEGACY_PROBE_NAME. The annotation packages found in Bioconductor >> dont use this probe ids so they cannot be used to map to public >> identifiers. I wonder if it is also available the CUSTUMER_PROBE_NAME, >> which has the form GExxxxx (x being numbers, like GE12209) and is the >> identifier used by Codelink. >> >> D. >> >> >>> Sean >>> >>> _______________________________________________ >>> Bioconductor mailing list >>> Bioconductor at stat.math.ethz.ch >>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor >>> >>> >> > > > -- Steffen Durinck, Ph.D. Oncogenomics Section Pediatric Oncology Branch National Cancer Institute, National Institutes of Health URL: http://home.ccr.cancer.gov/oncology/oncogenomics/ Phone: 301-402-8103 Address: Advanced Technology Center, 8717 Grovemont Circle Gaithersburg, MD 20877
ADD REPLY
0
Entering edit mode
Hi, there: I like the getHomolog solution (since the first one seems not workable for me) but i need to do some modification since there is an issue like this > getHomolog(id=ratUnigene[5], from.type="unigene", to.type="entrezgene", + from.mart=rat, to.mart=human) V1 V2 V3 1 ENSG00000095397 ENST00000362057 25861 2 ENSG00000095397 ENST00000265134 NA 3 ENSG00000095397 ENST00000361938 25861 4 ENSG00000095397 ENST00000374059 NA 5 ENSG00000095397 ENST00000374057 NA For one ratUnigene, there are five $V3. t1 <- sapply(ratUnigene, function(i) unique(getHomolog(id=i, from.type="unigene", to.type="entrezgene", from.mart=rat, to.mart=human)$V3)[1]) > as.character(t1) [1] "NULL" "10402" "NULL" "NULL" "25861" "8706" "195827" [8] "NULL" "NULL" "NULL" "NULL" "NULL" "55884" "NULL" [15] "NULL" "3898" "23324" "NULL" "NULL" "NULL" Of course, I assume, there are only the same id and NA for $V3. However, since I have ~7400 unigenes, it is supposed to end after 78 min. However, I run into a connection issue: > system.time(t1 <- sapply(ratUnigene, function(i) unique(getHomolog(id=i, from.type="unigen e", to.type="entrezgene",from.mart=rat, to.mart=human)$V3)[1])) Error in postForm(paste(to.mart at host, "?", sep = ""), query = xmlQuery) : couldn't connect to host In addition: There were 50 or more warnings (use warnings() to see the first 50) Timing stopped at: 1.641 0.22 444.603 0 0 So, I am wondering if there is a way to download a lookup table and do it locally. By the way, 78 minutes to do 7400 times' conversions. Weiwei
ADD REPLY
0
Entering edit mode
Hi Weiwei, By default biomaRt runs in webservice mode. Doing queries in a large loop in webservice mode do crash and in this case it is better to use the package in MySQL mode. In webservice mode you could make your look-up table by doing just the two queries that I suggested in the first solution. However there is an easier way to get what you want as the output of getHomolog, when using biomaRt in MySQL mode, does contain the query ids (rat unigene ids) and the result (human entrezgene ids) so no need for time consuming big loops. Try the following: human = useMart("ensembl", dataset="hsapiens_gene_ensembl", mysql=TRUE) rat = useMart("ensembl", dataset="rnorvegicus_gene_ensembl", mysql=TRUE) ratUnigene = c("Rn.32316","Rn.171821") getHomolog(id = ratUnigene, from.type="unigene", to.type="entrezgene",from.mart=rat, to.mart=human) It should give: id MappedID 1 Rn.32316 10402 2 Rn.171821 7058 Note that Ensembl maps everything to the transcript level, which explains why you might find redundant information in the output. Cheers, Steffen Weiwei Shi wrote: > Hi, there: > > I like the getHomolog solution (since the first one seems not workable > for me) but i need to do some modification since there is an issue > like this >> getHomolog(id=ratUnigene[5], from.type="unigene", to.type="entrezgene", > + from.mart=rat, to.mart=human) > V1 V2 V3 > 1 ENSG00000095397 ENST00000362057 25861 > 2 ENSG00000095397 ENST00000265134 NA > 3 ENSG00000095397 ENST00000361938 25861 > 4 ENSG00000095397 ENST00000374059 NA > 5 ENSG00000095397 ENST00000374057 NA > > For one ratUnigene, there are five $V3. > t1 <- sapply(ratUnigene, function(i) unique(getHomolog(id=i, > from.type="unigene", to.type="entrezgene", > from.mart=rat, to.mart=human)$V3)[1]) > >> as.character(t1) > [1] "NULL" "10402" "NULL" "NULL" "25861" "8706" "195827" > [8] "NULL" "NULL" "NULL" "NULL" "NULL" "55884" "NULL" > [15] "NULL" "3898" "23324" "NULL" "NULL" "NULL" > > Of course, I assume, there are only the same id and NA for $V3. > > However, since I have ~7400 unigenes, it is supposed to end after 78 > min. However, I run into a connection issue: > >> system.time(t1 <- sapply(ratUnigene, function(i) >> unique(getHomolog(id=i, from.type="unigen > e", to.type="entrezgene",from.mart=rat, to.mart=human)$V3)[1])) > Error in postForm(paste(to.mart at host, "?", sep = ""), query = xmlQuery) : > couldn't connect to host > In addition: There were 50 or more warnings (use warnings() to see the > first 50) > Timing stopped at: 1.641 0.22 444.603 0 0 > > So, I am wondering if there is a way to download a lookup table and do > it locally. By the way, 78 minutes to do 7400 times' conversions. > > > > Weiwei -- Steffen Durinck, Ph.D. Oncogenomics Section Pediatric Oncology Branch National Cancer Institute, National Institutes of Health URL: http://home.ccr.cancer.gov/oncology/oncogenomics/ Phone: 301-402-8103 Address: Advanced Technology Center, 8717 Grovemont Circle Gaithersburg, MD 20877
ADD REPLY
0
Entering edit mode
Hi, Steffen: It seems that it works b/c my internal method (using Oracle + Python) gives me the same number of identified unigene ids but : > system.time(t0 <- getHomolog(id = ratUnigene, from.type="unigene", + to.type="entrezgene",from.mart=rat, to.mart=human)) stack imbalance in .Call, 119 then 120 stack imbalance in <-, 117 then 118 stack imbalance in {, 115 then 116 stack imbalance in standardGeneric, 103 then 104 stack imbalance in class, 98 then 99 stack imbalance in <-, 96 then 97 stack imbalance in {, 94 then 95 stack imbalance in <-, 88 then 89 stack imbalance in {, 86 then 87 [1] 11.284 15.534 32.705 0.000 0.000 > dim(t0) [1] 4285 2 so, what are those stack imbalance..? This time it really is fast. Thanks.
ADD REPLY
0
Entering edit mode
another question on this: > getHomolog(id = "Rn.105679", from.type="unigene", + to.type="entrezgene",from.mart=rat, to.mart=human) stack imbalance in .Call, 102 then 103 stack imbalance in <-, 100 then 101 stack imbalance in {, 98 then 99 stack imbalance in standardGeneric, 86 then 87 stack imbalance in class, 81 then 82 stack imbalance in <-, 79 then 80 stack imbalance in {, 77 then 78 stack imbalance in <-, 71 then 72 stack imbalance in {, 69 then 70 NULL but I checked our database output: Rn.105679 its corresponding human locuslink id = 54838, I am really confused now at which is correct. On 11/3/06, Weiwei Shi <helprhelp at="" gmail.com=""> wrote: > Hi, Steffen: > > It seems that it works b/c my internal method (using Oracle + Python) > gives me the same number of identified unigene ids but : > > system.time(t0 <- getHomolog(id = ratUnigene, from.type="unigene", > + to.type="entrezgene",from.mart=rat, to.mart=human)) > stack imbalance in .Call, 119 then 120 > stack imbalance in <-, 117 then 118 > stack imbalance in {, 115 then 116 > stack imbalance in standardGeneric, 103 then 104 > stack imbalance in class, 98 then 99 > stack imbalance in <-, 96 then 97 > stack imbalance in {, 94 then 95 > stack imbalance in <-, 88 then 89 > stack imbalance in {, 86 then 87 > [1] 11.284 15.534 32.705 0.000 0.000 > > dim(t0) > [1] 4285 2 > > so, what are those stack imbalance..? > > This time it really is fast. Thanks. > -- Weiwei Shi, Ph.D Research Scientist GeneGO, Inc. "Did you always know?" "No, I did not. But I believed..." ---Matrix III
ADD REPLY
0
Entering edit mode
On Friday 03 November 2006 17:26, Weiwei Shi wrote: > another question on this: > > getHomolog(id = "Rn.105679", from.type="unigene", > > + to.type="entrezgene",from.mart=rat, to.mart=human) > but I checked our database output: > Rn.105679 > its corresponding human locuslink id = 54838, > > I am really confused now at which is correct. The mapping from one ID to another and then to another species and then to another ID in the new species is bound to be slightly different depending on the institution doing the mapping. It appears that Ensembl does not have a gene that Unigene Rn.105679 maps to. It sounds like you are using NCBI mapping resources in "your database"; if you have a relational database of NCBI information, why not use it if you like. The important thing is to be able to describe whatever process you use in a reproducible way. Sean
ADD REPLY
0
Entering edit mode
Diego Diez ▴ 760
@diego-diez-4520
Last seen 3.5 years ago
Japan
Hi, On Thu, 2 Nov 2006, Weiwei Shi wrote: > hi, > > I am wondering if bioconductor can help me do some conversion from rat > codelink probe id's to human locuslink id's? As Sean suggested, you need to convert first codelink ids to some useful public id. For this step you can use the codelink annotation packages available in the download section for several codelink platforms (GEChip BiocViews). They use the information from codelink to map to Genbank, Unigene and Entrez Gene id (and to other databases thanks to AnnBuilder). Diego. > > A further, more general question is, is there some package which can > handle this kind of conversion between different id systems and > different species? > > Thank you, > > > -- > Weiwei Shi, Ph.D > Research Scientist > GeneGO, Inc. > > "Did you always know?" > "No, I did not. But I believed..." > ---Matrix III > > _______________________________________________ > Bioconductor mailing list > Bioconductor at stat.math.ethz.ch > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor >
ADD COMMENT
0
Entering edit mode
Diego Diez wrote: > Hi, > > On Thu, 2 Nov 2006, Weiwei Shi wrote: > > >> hi, >> >> I am wondering if bioconductor can help me do some conversion from rat >> codelink probe id's to human locuslink id's? >> > > As Sean suggested, you need to convert first codelink ids to some useful > public id. For this step you can use the codelink annotation packages > available in the download section for several codelink platforms (GEChip > BiocViews). They use the information from codelink to map to Genbank, > Unigene and Entrez Gene id (and to other databases thanks to AnnBuilder). > Just to continue that thought, there is a rat homology package available here: http://bioconductor.org/packages/1.9/data/annotation/html/rnohomology. html Sean
ADD REPLY
0
Entering edit mode
@steffen-durinck-1780
Last seen 9.7 years ago
Hi WeiWei, I never encountered these stack imbalance messages and am wondering if they are produced by the system.time function which you wrapped around getHomolog. There is no .Call function directly in the biomaRt code so it looks like this message is either coming from RMySQL or RCurl or system.time(). Do you also get these messages when running getHomolog directly without system.time? best, Steffen -----Original Message----- From: Weiwei Shi [mailto:helprhelp@gmail.com] Sent: Fri 11/3/2006 5:03 PM To: Durinck, Steffen (NIH/NCI) [F] Cc: Diego Diez; bioconductor Subject: Re: Re: [BioC] from rat codelink to human locuslink Hi, Steffen: It seems that it works b/c my internal method (using Oracle + Python) gives me the same number of identified unigene ids but : > system.time(t0 <- getHomolog(id = ratUnigene, from.type="unigene", + to.type="entrezgene",from.mart=rat, to.mart=human)) stack imbalance in .Call, 119 then 120 stack imbalance in <-, 117 then 118 stack imbalance in {, 115 then 116 stack imbalance in standardGeneric, 103 then 104 stack imbalance in class, 98 then 99 stack imbalance in <-, 96 then 97 stack imbalance in {, 94 then 95 stack imbalance in <-, 88 then 89 stack imbalance in {, 86 then 87 [1] 11.284 15.534 32.705 0.000 0.000 > dim(t0) [1] 4285 2 so, what are those stack imbalance..? This time it really is fast. Thanks.
ADD COMMENT

Login before adding your answer.

Traffic: 422 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6