Question

Getting HGNC gene names from Ensembl transcript IDs (e.g., ENST0000...)

0

Entering edit mode

kmuench ▴ 40

@kmuench-9243

Last seen 5.5 years ago

United States

Hello,

In R, I previously used this piece of code to look up Ensembl IDs for lists of genes beginning with ENSG000... . In this example, my_df is a dataframe where the rownames are the gene IDs 9e.g. ENSG...):

  my_df$ensembl <- sapply( strsplit( rownames(my_df), split="\\+" ), "[", 1 )
  ensembl = useMart("ENSEMBL_MART_ENSEMBL",dataset="hsapiens_gene_ensembl", host="www.ensembl.org") # reflects recent change to hosting, as discussed in https://support.bioconductor.org/p/74322/
  genemap <- getBM( attributes = c("ensembl_gene_id", "entrezgene", "hgnc_symbol"),
                    filters = "ensembl_gene_id",
                    values = my_df$ensembl,
                    mart = ensembl )
  idx <- match( my_df$ensembl, genemap$ensembl_gene_id )
  my_df$entrez <- genemap$entrezgene[ idx ]
  my_df$hgnc_symbol <- genemap$hgnc_symbol[ idx ]

I'd now like to use this on a dataframe where the input row names are transcript IDs (e.g. ENST000...). I'm not sure whether I can do this with BioMart - does anyone know?

rnaseq r biomart • 8.9k views

ADD COMMENT • link updated 8.4 years ago by Johannes Rainer ★ 2.0k • written 8.4 years ago by kmuench ▴ 40

0

Entering edit mode

The Ensembl mart also provides transcript IDs (via the ensembl_transcript_id attribute) so I don't see why you couldn't do the same with transcript IDs instead of gene IDs. Use listAttributes() to list all the attributes available for your dataset.

H.

ADD REPLY • link 8.4 years ago Hervé Pagès 16k

score 1 · Answer 1 · 2015-11-20

1

Entering edit mode

Johannes Rainer ★ 2.0k

@johannes-rainer-6987

Last seen 13 days ago

Italy

you could actually use stuff from the ensembldb package to get the mapping between transcript ids and gene names (HGNC):

library(EnsDb.Hsapiens.v75)

edb <- EnsDb.Hsapiens.v75

## Get all transcripts defined in Ensembl (version 75):

tx <- transcripts(edb, columns=c("tx_id", "gene_id", "gene_name"))

## you can then extract the transcript ids and gene names or even

mapping <- cbind(tx_id=tx$tx_id, name=tx$gene_name)

rownames(mapping) <- mapping[, 1]

head(mapping)

> head(mapping)
                tx_id             name    
ENST00000373020 "ENST00000373020" "TSPAN6"
ENST00000494424 "ENST00000494424" "TSPAN6"
ENST00000496771 "ENST00000496771" "TSPAN6"
ENST00000373031 "ENST00000373031" "TNMD"  
ENST00000485971 "ENST00000485971" "TNMD"  
ENST00000371582 "ENST00000371582" "DPM1"

hope that helps

cheers, jo

ADD COMMENT • link 8.4 years ago Johannes Rainer ★ 2.0k

0

Entering edit mode

Thank you for the suggestion! I'm having trouble with the install (getting a "there is no package called ‘EnsDb.Hsapiens.v75’" message) but this looks like what I want - I'll keep trying.

ADD REPLY • link 8.4 years ago kmuench ▴ 40

0

Entering edit mode

For posterity: my issue is specifically that the library doesn't seem to be available yet for my version of R (error: "package ‘ensembldb’ is not available (for R version 3.1.2)"). Currently working through these possibilities: http://stackoverflow.com/questions/25721884/how-should-i-deal-with-package-xxx-is-not-available-for-r-version-x-y-z-wa

ADD REPLY • link 8.4 years ago kmuench ▴ 40

0

Entering edit mode

Check out the ensembldb landing page. In the gray-and-white striping 'Details' section it says the package has been in Bioconductor since BioC 3.1 (R-3.2). To use the package, simply install the current version of R (R-3.2.2) and following the usual source() / biocLite() instructions on the landing page. The package will never be made available for an older version of R than the version it was introduced in, so the 'yet' in your comment is too optimistic!