Hi, you can use biomaRt for this, although there are other solutions within Bioconductor itself.
Here, your 500 Ensembl gene IDs would be stored in my_genes
, and we then create a lookup table that you can use for matching between Ensembl gene IDs and HGNC symbols.
require('biomaRt')
mart <- useMart('ENSEMBL_MART_ENSEMBL')
mart <- useDataset('hsapiens_gene_ensembl', mart)
my_genes <- c('ENSG00000004468','ENSG00000210049',
'ENSG00000211459','ENSG00000210077',
'ENSG00000210082','ENSG00000209082',
'ENSG00000198888','ENSG00000257171',
'ENSG00000227447','ENSG00000234089',
'ENSG00000224565','ENSG00000205476','ENSG00000277040')
lookup <- getBM(
mart = mart,
attributes = c('entrezgene_id', 'ensembl_gene_id',
'gene_biotype','hgnc_symbol'),
filter = 'ensembl_gene_id',
values = my_genes,
uniqueRows = TRUE)
lookup
entrezgene_id ensembl_gene_id gene_biotype hgnc_symbol
1 952 ENSG00000004468 protein_coding CD38
2 4535 ENSG00000198888 protein_coding MT-ND1
3 317762 ENSG00000205476 protein_coding CCDC85C
4 NA ENSG00000209082 Mt_tRNA MT-TL1
5 NA ENSG00000210049 Mt_tRNA MT-TF
6 NA ENSG00000210077 Mt_tRNA MT-TV
7 NA ENSG00000210082 Mt_rRNA MT-RNR2
8 NA ENSG00000211459 Mt_rRNA MT-RNR1
9 NA ENSG00000224565 lncRNA LINC01754
10 NA ENSG00000227447 unprocessed_pseudogene XGY1
11 NA ENSG00000234089 lncRNA
12 NA ENSG00000257171 transcribed_unprocessed_pseudogene
13 NA ENSG00000277040 processed_pseudogene
Please check the manual pages for the function getBM() in order to learn more.
Use listAttributes(mart)
to see which other fields you can pull from Ensembl's servers, in terms of annotation.
Kevin
Thank you so much Kevin. I have about 500 ENSMBL IDs. How can I create a vector from them that can be read by Biomart? I am new to R and appreciate it if you help me with this as well.
Hey Hamidreza. Sure thing. Where are the IDs stored, currently? - a file on your disk?
Hi Kevin, yes I have them as ENSMBL numbers as a column in Excel sheet. Your comments have been helping me a lot. :)
Okay, you can try to import the data directly into R via certain packages that can interpret Excel files, such as readxl. There is some useful information here about it: http://www.sthda.com/english/wiki/reading-data-from-excel-files-xls-xlsx-into-r
Alternatively, with the Excel sheet open, you can copy the Ensembl IDs and paste them into an empty text file, called test.txt:
test.txt
Then, within R:
Thanks Kevin. It worked well.
Hi Kevin. This step helped me visualise my gene IDs. However, I have 55,980 gene IDs. How do you I load them into a list to follow the code you've written above to obtain gene symbols. I am performing RNA-Seq analysis and have the gene IDs from feature counts. Thanks so much!
Hi, in which format is your data? - TSV file? Can you show an example. Here is another answer, by the way: https://www.biostars.org/p/9461782/#9461790