TCGA biomart conversion
1
0
Entering edit mode
@jarod_v6liberoit-6654
Last seen 5.1 years ago
Italy

I want to convert ensembl ID to gene symbol using biomart.

My ensemble ID are write with the version like this:

ENSG00000066322.11

ENSG00000066336.10

ENSG00000066379.13

 

I have two problems: 1) Which the right version of ensembl to use for version of genome GRCh38.d1.vd1

2) How can extract from buiomart.

 

ensembl = useMart( host="dec2017.archive.ensembl.org",biomart="ENSEMBL_MART_ENSEMBL", dataset = "hsapiens_gene_ensembl" )

genemap <- getBM( attributes = c("ensembl_gene_id", "hgnc_symbol"),
                  filters = "ensembl_gene_id",
                  values = data$ensembl,
                  mart = ensembl )

 

 

> sessionInfo()
R version 3.4.4 (2018-03-15)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 16.04.4 LTS

Matrix products: default
BLAS: /usr/lib/libblas/libblas.so.3.6.0
LAPACK: /usr/lib/lapack/liblapack.so.3.6.0

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C               LC_TIME=it_IT.UTF-8       
 [4] LC_COLLATE=en_US.UTF-8     LC_MONETARY=it_IT.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=it_IT.UTF-8       LC_NAME=C                  LC_ADDRESS=C              
[10] LC_TELEPHONE=C             LC_MEASUREMENT=it_IT.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] parallel  stats4    stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] gplots_3.0.1               genefilter_1.60.0          limma_3.34.9              
 [4] biomaRt_2.34.2             reshape2_1.4.3             RColorBrewer_1.1-2        
 [7] ggplot2_2.2.1              pheatmap_1.0.10            DESeq2_1.18.1             
[10] SummarizedExperiment_1.8.1 DelayedArray_0.4.1         matrixStats_0.53.1        
[13] Biobase_2.38.0             GenomicRanges_1.30.3       GenomeInfoDb_1.14.0       
[16] IRanges_2.12.0             S4Vectors_0.16.0           BiocGenerics_0.24.0       
[19] GenomicDataCommons_1.5.3   magrittr_1.5              

loaded via a namespace (and not attached):
 [1] bitops_1.0-6           bit64_0.9-7            progress_1.1.2         httr_1.3.1            
 [5] tools_3.4.4            backports_1.1.2        utf8_1.1.4             R6_2.2.2              
 [9] KernSmooth_2.23-15     rpart_4.1-13           Hmisc_4.1-1            DBI_1.0.0             
[13] lazyeval_0.2.1         colorspace_1.3-2       nnet_7.3-12            tidyselect_0.2.4      
[17] gridExtra_2.3          prettyunits_1.0.2      bit_1.1-14             curl_3.2              
[21] compiler_3.4.4         cli_1.0.0              htmlTable_1.12         xml2_1.2.0            
[25] caTools_1.17.1         scales_0.5.0           checkmate_1.8.5        readr_1.1.1           
[29] rappdirs_0.3.1         stringr_1.3.1          digest_0.6.15          foreign_0.8-70        
[33] XVector_0.18.0         base64enc_0.1-3        pkgconfig_2.0.1        htmltools_0.3.6       
[37] htmlwidgets_1.2        rlang_0.2.1            rstudioapi_0.7         RSQLite_2.1.1         
[41] bindr_0.1.1            jsonlite_1.5           BiocParallel_1.12.0    gtools_3.5.0          
[45] acepack_1.4.1          dplyr_0.7.5            RCurl_1.95-4.10        GenomeInfoDbData_1.0.0
[49] Formula_1.2-3          Matrix_1.2-14          Rcpp_0.12.17           munsell_0.4.3         
[53] stringi_1.2.2          yaml_2.1.19            zlibbioc_1.24.0        plyr_1.8.4            
[57] grid_3.4.4             blob_1.1.1             gdata_2.18.0           crayon_1.3.4          
[61] lattice_0.20-35        splines_3.4.4          annotate_1.56.2        hms_0.4.2             
[65] locfit_1.5-9.1         knitr_1.20             pillar_1.2.3           geneplotter_1.56.0    
[69] XML_3.98-1.11          glue_1.2.0             latticeExtra_0.6-28    data.table_1.11.4     
[73] gtable_0.2.0           purrr_0.2.5            assertthat_0.2.0       xtable_1.8-2          
[77] survival_2.42-3        tibble_1.4.2           AnnotationDbi_1.40.0   memoise_1.1.0         
[81] bindrcpp_0.2.2         cluster_2.0.7-1      
biomart • 2.0k views
ADD COMMENT
2
Entering edit mode
Mike Smith ★ 6.5k
@mike-smith
Last seen 5 hours ago
EMBL Heidelberg

Here's an example of doing the conversion using biomaRt. You can use the versioned IDs you've got, but you'll see it's better the remove the version numbers.

First, we'll load biomaRt and use your example IDs.

library(biomaRt)
mart <- useMart(biomart = "ensembl", dataset = "hsapiens_gene_ensembl")

gene_ids_version <- c("ENSG00000236246.1",
                      "ENSG00000281088.1",
                      "ENSG00000254526.1",
                      "ENSG00000223575.2",
                      "ENSG00000201444.1",
                      "ENSG00000232573.1")

Now we can query BioMart, specifying that we want to use the versioned Ensembl Gene IDs by using the following:

getBM(attributes = c('ensembl_gene_id_version',
                     'hgnc_symbol'),
      filters = 'ensembl_gene_id_version', 
      values = gene_ids_version,
      mart = mart)

> 
  ensembl_gene_id_version hgnc_symbol
1       ENSG00000201444.1  RNU6-1082P
2       ENSG00000223575.2     RBMX2P3
3       ENSG00000232573.1      RPL3P4
4       ENSG00000254526.1            `
`

However, notice that we only get 4 results returned from our 6 IDs. This is because if you query using a version number, but it isn't the most recent version, it doesn't return a result - not really ideal. It's probably better to strip the version number to use just the Ensembl gene ID. We'll use the stringr package to do that here:

library(stringr)
gene_ids <- str_replace(gene_ids_version,
                        pattern = ".[0-9]+$",
                        replacement = "")

Now rerun the query with the trimmed IDs and you'll get 5 results this time:

getBM(attributes = c('ensembl_gene_id',
                     'hgnc_symbol'),
      filters = 'ensembl_gene_id', 
      values = gene_ids,
      mart = mart)

>
  ensembl_gene_id hgnc_symbol
1 ENSG00000201444  RNU6-1082P
2 ENSG00000223575     RBMX2P3
3 ENSG00000232573      RPL3P4
4 ENSG00000236246            
5 ENSG00000254526            `
`

The completely missing entry is because that gene, ENSG00000281088, has been retired from Ensembl, so you'll never get a result. The empty values for the rest are because there's no mapping between Ensembl ID and HGNC name for those genes.

ADD COMMENT

Login before adding your answer.

Traffic: 868 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6