Question: TCGA biomart conversion
0
gravatar for jarod_v6@libero.it
14 months ago by
Italy
jarod_v6@libero.it40 wrote:

I want to convert ensembl ID to gene symbol using biomart.

My ensemble ID are write with the version like this:

ENSG00000066322.11

ENSG00000066336.10

ENSG00000066379.13

 

I have two problems: 1) Which the right version of ensembl to use for version of genome GRCh38.d1.vd1

2) How can extract from buiomart.

 

ensembl = useMart( host="dec2017.archive.ensembl.org",biomart="ENSEMBL_MART_ENSEMBL", dataset = "hsapiens_gene_ensembl" )

genemap <- getBM( attributes = c("ensembl_gene_id", "hgnc_symbol"),
                  filters = "ensembl_gene_id",
                  values = data$ensembl,
                  mart = ensembl )

 

 

> sessionInfo()
R version 3.4.4 (2018-03-15)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 16.04.4 LTS

Matrix products: default
BLAS: /usr/lib/libblas/libblas.so.3.6.0
LAPACK: /usr/lib/lapack/liblapack.so.3.6.0

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C               LC_TIME=it_IT.UTF-8       
 [4] LC_COLLATE=en_US.UTF-8     LC_MONETARY=it_IT.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=it_IT.UTF-8       LC_NAME=C                  LC_ADDRESS=C              
[10] LC_TELEPHONE=C             LC_MEASUREMENT=it_IT.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] parallel  stats4    stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] gplots_3.0.1               genefilter_1.60.0          limma_3.34.9              
 [4] biomaRt_2.34.2             reshape2_1.4.3             RColorBrewer_1.1-2        
 [7] ggplot2_2.2.1              pheatmap_1.0.10            DESeq2_1.18.1             
[10] SummarizedExperiment_1.8.1 DelayedArray_0.4.1         matrixStats_0.53.1        
[13] Biobase_2.38.0             GenomicRanges_1.30.3       GenomeInfoDb_1.14.0       
[16] IRanges_2.12.0             S4Vectors_0.16.0           BiocGenerics_0.24.0       
[19] GenomicDataCommons_1.5.3   magrittr_1.5              

loaded via a namespace (and not attached):
 [1] bitops_1.0-6           bit64_0.9-7            progress_1.1.2         httr_1.3.1            
 [5] tools_3.4.4            backports_1.1.2        utf8_1.1.4             R6_2.2.2              
 [9] KernSmooth_2.23-15     rpart_4.1-13           Hmisc_4.1-1            DBI_1.0.0             
[13] lazyeval_0.2.1         colorspace_1.3-2       nnet_7.3-12            tidyselect_0.2.4      
[17] gridExtra_2.3          prettyunits_1.0.2      bit_1.1-14             curl_3.2              
[21] compiler_3.4.4         cli_1.0.0              htmlTable_1.12         xml2_1.2.0            
[25] caTools_1.17.1         scales_0.5.0           checkmate_1.8.5        readr_1.1.1           
[29] rappdirs_0.3.1         stringr_1.3.1          digest_0.6.15          foreign_0.8-70        
[33] XVector_0.18.0         base64enc_0.1-3        pkgconfig_2.0.1        htmltools_0.3.6       
[37] htmlwidgets_1.2        rlang_0.2.1            rstudioapi_0.7         RSQLite_2.1.1         
[41] bindr_0.1.1            jsonlite_1.5           BiocParallel_1.12.0    gtools_3.5.0          
[45] acepack_1.4.1          dplyr_0.7.5            RCurl_1.95-4.10        GenomeInfoDbData_1.0.0
[49] Formula_1.2-3          Matrix_1.2-14          Rcpp_0.12.17           munsell_0.4.3         
[53] stringi_1.2.2          yaml_2.1.19            zlibbioc_1.24.0        plyr_1.8.4            
[57] grid_3.4.4             blob_1.1.1             gdata_2.18.0           crayon_1.3.4          
[61] lattice_0.20-35        splines_3.4.4          annotate_1.56.2        hms_0.4.2             
[65] locfit_1.5-9.1         knitr_1.20             pillar_1.2.3           geneplotter_1.56.0    
[69] XML_3.98-1.11          glue_1.2.0             latticeExtra_0.6-28    data.table_1.11.4     
[73] gtable_0.2.0           purrr_0.2.5            assertthat_0.2.0       xtable_1.8-2          
[77] survival_2.42-3        tibble_1.4.2           AnnotationDbi_1.40.0   memoise_1.1.0         
[81] bindrcpp_0.2.2         cluster_2.0.7-1      
biomart • 391 views
ADD COMMENTlink modified 14 months ago by Mike Smith3.9k • written 14 months ago by jarod_v6@libero.it40
Answer: TCGA biomart conversion
2
gravatar for Mike Smith
14 months ago by
Mike Smith3.9k
EMBL Heidelberg / de.NBI
Mike Smith3.9k wrote:

Here's an example of doing the conversion using biomaRt. You can use the versioned IDs you've got, but you'll see it's better the remove the version numbers.

First, we'll load biomaRt and use your example IDs.

library(biomaRt)
mart <- useMart(biomart = "ensembl", dataset = "hsapiens_gene_ensembl")

gene_ids_version <- c("ENSG00000236246.1",
                      "ENSG00000281088.1",
                      "ENSG00000254526.1",
                      "ENSG00000223575.2",
                      "ENSG00000201444.1",
                      "ENSG00000232573.1")

Now we can query BioMart, specifying that we want to use the versioned Ensembl Gene IDs by using the following:

getBM(attributes = c('ensembl_gene_id_version',
                     'hgnc_symbol'),
      filters = 'ensembl_gene_id_version', 
      values = gene_ids_version,
      mart = mart)

> 
  ensembl_gene_id_version hgnc_symbol
1       ENSG00000201444.1  RNU6-1082P
2       ENSG00000223575.2     RBMX2P3
3       ENSG00000232573.1      RPL3P4
4       ENSG00000254526.1            `
`

However, notice that we only get 4 results returned from our 6 IDs. This is because if you query using a version number, but it isn't the most recent version, it doesn't return a result - not really ideal. It's probably better to strip the version number to use just the Ensembl gene ID. We'll use the stringr package to do that here:

library(stringr)
gene_ids <- str_replace(gene_ids_version,
                        pattern = ".[0-9]+$",
                        replacement = "")

Now rerun the query with the trimmed IDs and you'll get 5 results this time:

getBM(attributes = c('ensembl_gene_id',
                     'hgnc_symbol'),
      filters = 'ensembl_gene_id', 
      values = gene_ids,
      mart = mart)

>
  ensembl_gene_id hgnc_symbol
1 ENSG00000201444  RNU6-1082P
2 ENSG00000223575     RBMX2P3
3 ENSG00000232573      RPL3P4
4 ENSG00000236246            
5 ENSG00000254526            `
`

The completely missing entry is because that gene, ENSG00000281088, has been retired from Ensembl, so you'll never get a result. The empty values for the rest are because there's no mapping between Ensembl ID and HGNC name for those genes.

ADD COMMENTlink modified 14 months ago • written 14 months ago by Mike Smith3.9k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 16.09
Traffic: 257 users visited in the last hour