Search
Question: Getting HGNC gene names from Ensembl transcript IDs (e.g., ENST0000...)
0
gravatar for kmuench
2.1 years ago by
kmuench20
United States
kmuench20 wrote:

Hello,

In R, I previously used this piece of code to look up Ensembl IDs for lists of genes beginning with ENSG000... .  In this example, my_df is a dataframe where the rownames are the gene IDs 9e.g. ENSG...):

 

  my_df$ensembl <- sapply( strsplit( rownames(my_df), split="\\+" ), "[", 1 )
  ensembl = useMart("ENSEMBL_MART_ENSEMBL",dataset="hsapiens_gene_ensembl", host="www.ensembl.org") # reflects recent change to hosting, as discussed in https://support.bioconductor.org/p/74322/
  genemap <- getBM( attributes = c("ensembl_gene_id", "entrezgene", "hgnc_symbol"),
                    filters = "ensembl_gene_id",
                    values = my_df$ensembl,
                    mart = ensembl )
  idx <- match( my_df$ensembl, genemap$ensembl_gene_id )
  my_df$entrez <- genemap$entrezgene[ idx ]
  my_df$hgnc_symbol <- genemap$hgnc_symbol[ idx ]

 

I'd now like to use this on a dataframe where the input row names are transcript IDs (e.g. ENST000...). I'm not sure whether I can do this with BioMart - does anyone know?

ADD COMMENTlink modified 2.1 years ago by Johannes Rainer1.1k • written 2.1 years ago by kmuench20

The Ensembl mart also provides transcript IDs (via the ensembl_transcript_id attribute) so I don't see why you couldn't do the same with transcript IDs instead of gene IDs. Use listAttributes() to list all the attributes available for your dataset.

H.

ADD REPLYlink written 2.1 years ago by Hervé Pagès ♦♦ 13k
1
gravatar for Johannes Rainer
2.1 years ago by
Johannes Rainer1.1k
Italy
Johannes Rainer1.1k wrote:

you could actually use stuff from the ensembldb package to get the mapping between transcript ids and gene names (HGNC):

library(EnsDb.Hsapiens.v75)

edb <- EnsDb.Hsapiens.v75

## Get all transcripts defined in Ensembl (version 75):

tx <- transcripts(edb, columns=c("tx_id", "gene_id", "gene_name"))

## you can then extract the transcript ids and gene names or even

mapping <- cbind(tx_id=tx$tx_id, name=tx$gene_name)

rownames(mapping) <- mapping[, 1]

head(mapping)

> head(mapping)
                tx_id             name    
ENST00000373020 "ENST00000373020" "TSPAN6"
ENST00000494424 "ENST00000494424" "TSPAN6"
ENST00000496771 "ENST00000496771" "TSPAN6"
ENST00000373031 "ENST00000373031" "TNMD"  
ENST00000485971 "ENST00000485971" "TNMD"  
ENST00000371582 "ENST00000371582" "DPM1"  

 

hope that helps

cheers, jo

ADD COMMENTlink written 2.1 years ago by Johannes Rainer1.1k

Thank you for the suggestion! I'm having trouble with the install (getting a "there is no package called ‘EnsDb.Hsapiens.v75’" message) but this looks like what I want - I'll keep trying.

ADD REPLYlink written 2.1 years ago by kmuench20

For posterity: my issue is specifically that the library doesn't seem to be available yet for my version of R (error: "package ‘ensembldb’ is not available (for R version 3.1.2)"). Currently working through these possibilities: http://stackoverflow.com/questions/25721884/how-should-i-deal-with-package-xxx-is-not-available-for-r-version-x-y-z-wa

ADD REPLYlink written 2.1 years ago by kmuench20

Check out the ensembldb landing page. In the gray-and-white striping 'Details' section it says the package has been in Bioconductor since BioC 3.1 (R-3.2). To use the package, simply install the current version of R (R-3.2.2) and following the usual source() / biocLite() instructions on the landing page. The package will never be made available for an older version of R than the version it was introduced in, so the 'yet' in your comment is too optimistic!

ADD REPLYlink modified 2.1 years ago • written 2.1 years ago by Martin Morgan ♦♦ 21k

Haha! Thank you for pointing that out!

ADD REPLYlink written 2.1 years ago by kmuench20
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.2.0
Traffic: 159 users visited in the last hour