ENSEMBL to gene symbol
1
0
Entering edit mode
Matt • 0
@b0a618e7
Last seen 2.0 years ago
United States

Hello I am having a problem with matching my ENSEMBL ID's to the corresponding gene symbol. When I try to use the filter "ensembl_gene_id_version" I only get an output of 11 corresponding gene symbols. When I use no filter I get an output of around 50,000 gene symbols. Which is confusing considering I have only 15,000 ENSEMBL ID's. This creates a problem for me when I try to merge the counts data frame with the gene symbol output from getBM. I have tried different filters, I am using the most up to data version of the dataset. Code should be placed in three backticks as shown below

mart<- useMart(biomart ='ensembl', dataset = 'mmusculus_gene_ensembl', host='useast.ensembl.org')all_coding_genes<- getBM(attributes=c("mgi_symbol"),values= row.names(res_ordered), filters = "ensembl_gene_id_version", mart= mart, uniqueRows = TRUE)

include your problematic code here with any corresponding output

please also include the results of running the following in an R session

sessionInfo( )

```> sessionInfo() R version 4.2.0 (2022-04-22 ucrt) Platform: x86_64-w64-mingw32/x64 (64-bit) Running under: Windows 10 x64 (build 22000)

Matrix products: default

locale: [1] LC_COLLATE=English_United States.utf8 [2] LC_CTYPE=English_United States.utf8
[3] LC_MONETARY=English_United States.utf8 [4] LC_NUMERIC=C
[5] LC_TIME=English_United States.utf8

attached base packages: [1] tools stats4 stats graphics grDevices [6] utils datasets methods base

other attached packages: [1] fuzzyjoin_0.1.6
[2] org.Mm.eg.db_3.15.0
[3] AnnotationDbi_1.58.0
[4] RColorBrewer_1.1-3
[5] pheatmap_1.0.12
[6] colorspace_2.0-3
[7] EnhancedVolcano_1.14.0
[8] ggrepel_0.9.1
[9] forcats_0.5.1
[10] stringr_1.4.0
[11] purrr_0.3.4
[12] readr_2.1.2
[13] tidyr_1.2.0
[14] ggplot2_3.3.5
[15] tidyverse_1.3.1
[16] DESeq2_1.36.0
[17] SummarizedExperiment_1.26.1 [18] Biobase_2.56.0
[19] MatrixGenerics_1.8.0
[20] matrixStats_0.62.0
[21] GenomicRanges_1.48.0
[22] GenomeInfoDb_1.32.1
[23] IRanges_2.30.0
[24] S4Vectors_0.34.0
[25] BiocGenerics_0.42.0
[26] tibble_3.1.6
[27] dplyr_1.0.8
[28] R.utils_2.11.0
[29] R.oo_1.24.0
[30] R.methodsS3_1.8.1
[31] biomaRt_2.52.0
[32] BiocManager_1.30.18

loaded via a namespace (and not attached): [1] fs_1.5.2 bitops_1.0-7
[3] lubridate_1.8.0 bit64_4.0.5
[5] filelock_1.0.2 progress_1.2.2
[7] httr_1.4.3 backports_1.4.1
[9] utf8_1.2.2 R6_2.5.1
[11] DBI_1.1.2 withr_2.5.0
[13] tidyselect_1.1.2 prettyunits_1.1.1
[15] bit_4.0.4 curl_4.3.2
[17] compiler_4.2.0 rvest_1.0.2
[19] cli_3.3.0 xml2_1.3.3
[21] DelayedArray_0.22.0 scales_1.2.0
[23] genefilter_1.78.0 rappdirs_0.3.3
[25] digest_0.6.29 XVector_0.36.0
[27] pkgconfig_2.0.3 dbplyr_2.2.0
[29] fastmap_1.1.0 readxl_1.4.0
[31] rlang_1.0.2 rstudioapi_0.13
[33] RSQLite_2.2.14 generics_0.1.2
[35] jsonlite_1.8.0 BiocParallel_1.30.0
[37] RCurl_1.98-1.6 magrittr_2.0.3
[39] GenomeInfoDbData_1.2.8 Matrix_1.4-1
[41] Rcpp_1.0.8.3 munsell_0.5.0
[43] fansi_1.0.3 lifecycle_1.0.1
[45] stringi_1.7.6 zlibbioc_1.42.0
[47] BiocFileCache_2.4.0 grid_4.2.0
[49] blob_1.2.3 parallel_4.2.0
[51] crayon_1.5.1 lattice_0.20-45
[53] Biostrings_2.64.0 haven_2.5.0
[55] splines_4.2.0 annotate_1.74.0
[57] hms_1.1.1 KEGGREST_1.36.0
[59] locfit_1.5-9.5 pillar_1.7.0
[61] geneplotter_1.74.0 reprex_2.0.1
[63] XML_3.99-0.9 glue_1.6.2
[65] modelr_0.1.8 png_0.1-7
[67] vctrs_0.4.1 tzdb_0.3.0
[69] cellranger_1.1.0 gtable_0.3.0
[71] assertthat_0.2.1 cachem_1.0.6
[73] xtable_1.8-4 broom_0.8.0
[75] survival_3.3-1 memoise_2.0.1
[77] ellipsis_0.3.2

biomaRt DESeq2 • 3.1k views
ADD COMMENT
0
Entering edit mode
@charlesfoster-17652
Last seen 8 hours ago
Australia

Assuming the rownames are ensembl gene IDs, try:

all_coding_genes <- getBM(attributes=c("mgi_symbol", "ensembl_gene_id"), 
                          values= row.names(res_ordered), 
                          filters = "ensembl_gene_id", 
                          mart = mart, 
                          uniqueRows = TRUE)
ADD COMMENT
0
Entering edit mode

Hello Charles thank you for your response, unfortunately the output is the same, I get a dataframe of 0 obs. with 2 variables. It appear that the filter portion of the function is causing the problems.

ADD REPLY
0
Entering edit mode

You may also want to consider to make use of the EnsDb annotation package.

To extract some annotation info: (AFAIK the column SYMBOL is the MGI Symbol).

> library(EnsDb.Mmusculus.v106)
> 
> ensids <- keys(EnsDb.Mmusculus.v106)[1:15]
> 
> annotation.info <- AnnotationDbi:::select(EnsDb.Mmusculus.v106, keys = ensids, keytype = "GENEID",
+     columns = c("GENEID", "GENENAME", "DESCRIPTION", "SYMBOL", "GENEBIOTYPE", "ENTREZID",
+                 "SEQNAME", "SEQSTRAND", "PROTEINID", "UNIPROTID",  "UNIPROTDB") )
> 
> # Get rid of duplicates; only keep 1st hit
> annotation.info <- annotation.info[!duplicated(annotation.info[,1]),] 
> 
> #check
> head(annotation.info)
               GENEID GENENAME
1  ENSMUSG00000000001    Gnai3
2  ENSMUSG00000000003     Pbsn
5  ENSMUSG00000000028    Cdc45
10 ENSMUSG00000000031      H19
11 ENSMUSG00000000037    Scml2
20 ENSMUSG00000000049     Apoh
                                                                                            DESCRIPTION
1  guanine nucleotide binding protein (G protein), alpha inhibiting 3 [Source:MGI Symbol;Acc:MGI:95773]
2                                                          probasin [Source:MGI Symbol;Acc:MGI:1860484]
5                                            cell division cycle 45 [Source:MGI Symbol;Acc:MGI:1338073]
10                     H19, imprinted maternally expressed transcript [Source:MGI Symbol;Acc:MGI:95891]
11                                Scm polycomb group protein like 2 [Source:MGI Symbol;Acc:MGI:1340042]
20                                                   apolipoprotein H [Source:MGI Symbol;Acc:MGI:88058]
   SYMBOL    GENEBIOTYPE ENTREZID SEQNAME SEQSTRAND          PROTEINID
1   Gnai3 protein_coding    14679       3        -1 ENSMUSP00000000001
2    Pbsn protein_coding    54192       X        -1 ENSMUSP00000000003
5   Cdc45 protein_coding    12544      16        -1 ENSMUSP00000000028
10    H19         lncRNA    14955       7        -1               <NA>
11  Scml2 protein_coding   107815       X         1 ENSMUSP00000158772
20   Apoh protein_coding    11818      11         1 ENSMUSP00000000049
      UNIPROTID UNIPROTDB
1    Q9DC51.172 SWISSPROT
2    Q3UV89.121  SPTREMBL
5    Q3UI99.108  SPTREMBL
10         <NA>      <NA>
11 A0A571BDL9.8  SPTREMBL
20   Q01339.166 SWISSPROT
> 

I downloaded the EnsDb (EnsDb.Mmusculus.v106) from the AnnotationHub.

See for more on this e.g. here (ensembldb EnsDb databases for Ensembl release 101 added to AnnotationHub), and for some code to get you started: EnsDb.Rnorvegicus for Rnor6. You obviously have to adapt the code accordingly for your use case (i.e. for Mus musculus, and for v106).

ADD REPLY

Login before adding your answer.

Traffic: 736 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6