Question

gene ID conversion

0

Entering edit mode

Hamidreza Hashemi ▴ 20

@hamidreza-hashemi-23384

Last seen 4.1 years ago

United States

Hi All,

I have a list of 500 genes in ENSMBL ID format and I need to convert them to gene symbols. Here is an example: ENSG00000004468 gene symbol: CD38

I tried the following online tool but it failed to convert some of my genes, while I can find them when I search in the ENSMBL data base. Do you know if there is an updated R package with human gene annotations? What package would you recommend for this purpose? I am afraid I would lose some of the genes in my list since I have read that not all data bases are updated with the most recent human gene annotations. Thank you so much for any information!

annotation • 8.5k views

ADD COMMENT • link updated 2.1 years ago by Kevin Blighe ★ 4.0k • written 4.9 years ago by Hamidreza Hashemi ▴ 20

score 3 · Answer 1 · 2020-04-27

3

Entering edit mode

Kevin Blighe ★ 4.0k

@kevin

Last seen 10 days ago

Republic of Ireland

Hi, you can use biomaRt for this, although there are other solutions within Bioconductor itself.

Here, your 500 Ensembl gene IDs would be stored in my_genes, and we then create a lookup table that you can use for matching between Ensembl gene IDs and HGNC symbols.

require('biomaRt')
mart <- useMart('ENSEMBL_MART_ENSEMBL')
mart <- useDataset('hsapiens_gene_ensembl', mart)

my_genes <- c('ENSG00000004468','ENSG00000210049',
  'ENSG00000211459','ENSG00000210077',
  'ENSG00000210082','ENSG00000209082',
  'ENSG00000198888','ENSG00000257171',
  'ENSG00000227447','ENSG00000234089',
  'ENSG00000224565','ENSG00000205476','ENSG00000277040')
lookup <- getBM(
  mart = mart,
  attributes = c('entrezgene_id', 'ensembl_gene_id',
    'gene_biotype','hgnc_symbol'),
  filter = 'ensembl_gene_id',
  values = my_genes,
  uniqueRows = TRUE)

lookup
   entrezgene_id ensembl_gene_id                       gene_biotype hgnc_symbol
1            952 ENSG00000004468                     protein_coding        CD38
2           4535 ENSG00000198888                     protein_coding      MT-ND1
3         317762 ENSG00000205476                     protein_coding     CCDC85C
4             NA ENSG00000209082                            Mt_tRNA      MT-TL1
5             NA ENSG00000210049                            Mt_tRNA       MT-TF
6             NA ENSG00000210077                            Mt_tRNA       MT-TV
7             NA ENSG00000210082                            Mt_rRNA     MT-RNR2
8             NA ENSG00000211459                            Mt_rRNA     MT-RNR1
9             NA ENSG00000224565                             lncRNA   LINC01754
10            NA ENSG00000227447             unprocessed_pseudogene        XGY1
11            NA ENSG00000234089                             lncRNA            
12            NA ENSG00000257171 transcribed_unprocessed_pseudogene            
13            NA ENSG00000277040               processed_pseudogene

Please check the manual pages for the function getBM() in order to learn more.

Use listAttributes(mart) to see which other fields you can pull from Ensembl's servers, in terms of annotation.

Kevin

ADD COMMENT • link 4.9 years ago Kevin Blighe ★ 4.0k

0

Entering edit mode

Thank you so much Kevin. I have about 500 ENSMBL IDs. How can I create a vector from them that can be read by Biomart? I am new to R and appreciate it if you help me with this as well.

ADD REPLY • link 4.9 years ago Hamidreza Hashemi ▴ 20

0

Entering edit mode

Hey Hamidreza. Sure thing. Where are the IDs stored, currently? - a file on your disk?

ADD REPLY • link 4.9 years ago Kevin Blighe ★ 4.0k

0

Entering edit mode

Hi Kevin, yes I have them as ENSMBL numbers as a column in Excel sheet. Your comments have been helping me a lot. :)

ADD REPLY • link 4.9 years ago Hamidreza Hashemi ▴ 20

0

Entering edit mode

Okay, you can try to import the data directly into R via certain packages that can interpret Excel files, such as readxl. There is some useful information here about it: http://www.sthda.com/english/wiki/reading-data-from-excel-files-xls-xlsx-into-r

Alternatively, with the Excel sheet open, you can copy the Ensembl IDs and paste them into an empty text file, called test.txt:

test.txt

ENS0001
ENS0002
ENS0003

Then, within R:

ens <- read.table('test.txt', header = FALSE, stringsAsFactors = FALSE)[,1]
ens
[1] "ENS0001" "ENS0002" "ENS0003"

ADD REPLY • link 4.9 years ago Kevin Blighe ★ 4.0k

1

Entering edit mode

Thanks Kevin. It worked well.

ADD REPLY • link 4.9 years ago Hamidreza Hashemi ▴ 20

0

Entering edit mode

Hi Kevin. This step helped me visualise my gene IDs. However, I have 55,980 gene IDs. How do you I load them into a list to follow the code you've written above to obtain gene symbols. I am performing RNA-Seq analysis and have the gene IDs from feature counts. Thanks so much!

ADD REPLY • link 2.1 years ago Niharika • 0

0

Entering edit mode

Hi, in which format is your data? - TSV file? Can you show an example. Here is another answer, by the way: https://www.biostars.org/p/9461782/#9461790

ADD REPLY • link 2.1 years ago Kevin Blighe ★ 4.0k