R: is there an identifier that uniquely identifies a gene all over the many databases ?
2
0
Entering edit mode
@mauedealiceit-3511
Last seen 9.6 years ago
I forgot to specify that I am only dealing with Human species. I used the ENSGxxxxx identifier to get out some data that I hoped would uniquely identify the gene. > gene.map <- getBM(attributes=c("hgnc_symbol","external_gene_id","refseq_dna"), filters ="ensembl_gene_id",values="ENSG00000206557",mart=hmart) > show(gene.map) As long as all Human genes are uniquely identified through their respective "hgnc_symbol" I am fine. Why should I use the other identifier you mention ENSTxxxx ? My goal is to get the 3UTR sequence associated to experimentally validated genes. Through entering "Human" species and miRNA identifier "hsa-miR-yyy" TarBase interface returns a list of all gene ENSGxxxxxx that have been experimentally tested. I input such ENSGxxxxxx identifier to getSequence (BioMat function) to get the 3UTRr sequence. I was surprised to find multiple 3UTR sequences associated to the same ENSGxxxxxx. Maybe each transcript is identified by a unique ENSTxxxx identifier... TRUE/FALSE ? Thank you. Regards, Maura -----Messaggio originale----- Da: Simon Anders [mailto:anders@ebi.ac.uk] Inviato: dom 12/07/2009 23.14 A: mauede@alice.it Cc: Bioconductor List Oggetto: Re: [BioC] is there an identifier that uniquely identifies a gene all over the many databases ? Hi Maura mauede@alice.it wrote: > By trial-and-error it seems the attribute "hgnc_symbol" yields a unique gene identifier ... but I am not quite sure. > Instead a variable numbers of " refseq_dna" values are listed for the same "hgnc_symbol". HGNC is the Human Genome Organisation's Gene Nomencalture Committee. Their gene symbols are in fact unique (that is the whole point of HGNC) but not every gene has a HGNC symbol yet. See http://www.genenames.org/ for more information. > In short, given the "ensembl_gene_id" (ENSGxxxxxxxxxxx), is it possible to get the gene identifier for which this is a transcript ? First of all, ENSGxxxxx IDs are for human genes. Human transcripts get ENSTxxxx identifiers (with a "T" insetad of a "G"). Each Ensembl gene can have several Ensembl transcripts, listing all the known splice variants. Play a bit with the Ensembl web site to see examples. To get the HGNC symbol for an ensembl gene ID, an easy way is to use biomaRt. Ask again if you are not familiar with it. Simon +--- | Dr. Simon Anders, Dipl. Phys. | European Bioinformatics Institute (EMBL-EBI) | Hinxton, Cambridgeshire, UK | office phone +44-1223-492680, mobile phone +44-7505-841692 | preferred (permanent) e-mail: sanders@fs.tum.de tutti i telefonini TIM! [[alternative HTML version deleted]]
miRNA miRNA • 1.2k views
ADD COMMENT
0
Entering edit mode
@steve-lianoglou-2771
Last seen 14 months ago
United States
Hi, > My goal is to get the 3UTR sequence associated to experimentally > validated genes. > Through entering "Human" species and miRNA identifier "hsa-miR-yyy" > TarBase interface returns a > list of all gene ENSGxxxxxx that have been experimentally tested. > I input such ENSGxxxxxx identifier to getSequence (BioMat function) > to get the 3UTRr sequence. > I was surprised to find multiple 3UTR sequences associated to the > same ENSGxxxxxx. > Maybe each transcript is identified by a unique ENSTxxxx > identifier... TRUE/FALSE ? That's likely the case, but you can easily verify this yourself. Just add "ensembl_transcript_id" (in addition to the ensembl_gene_id you already have) as one of the attributes you'd like returned in your getBM query to see if that explains the multiple 3_utr_start/end results you get. -steve -- Steve Lianoglou Graduate Student: Physiology, Biophysics and Systems Biology Weill Medical College of Cornell University Contact Info: http://cbio.mskcc.org/~lianos
ADD COMMENT
0
Entering edit mode
Simon Anders ▴ 150
@simon-anders-2626
Last seen 9.6 years ago
Hi mauede at alice.it wrote: > I forgot to specify that I am only dealing with Human species. > I used the ENSGxxxxx identifier to get out some data that I hoped would > uniquely identify the gene. > > > gene.map <- > getBM(attributes=c("hgnc_symbol","external_gene_id","refseq_dna"), > filters > ="ensembl_gene_id",values="ENSG00000206557",mart=hmart) > > show(gene.map) > > As long as all Human genes are uniquely identified through their > respective "hgnc_symbol" I am fine. > > Why should I use the other identifier you mention ENSTxxxx ? Well, I mentioned them because you talked about genes and transcripts as if these two were interchangeable. If you use Ensembl's Biomart you will usually get one data record each transcript, not for each gene. Take, for example, the gene GLB1 (ENSG00000170266). It has three transcripts: http://www.ensembl.org/Homo_sapiens/Gene/Summary?db=core;g=ENSG0000017 0266;r=3:32621636-33121635;t=ENST00000307363 The first transcript (ENST00000307377) has another 3'UTR than the second and third (ENST00000307363 and ENST00000399402). As Steven wrote, you should add "ensembl_transcript_id" to you list of attributes to see what is going on. Personally, I also find it very helpful to first try out any Biomart query on the web interface http://www.ensembl.org/biomart/martview before going to R. There, you can see quite easily what is going on. Cheers Simon
ADD COMMENT

Login before adding your answer.

Traffic: 767 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6