Question: ensembldb and pseudogenes mapping to the same Ensembl ID
gravatar for meeta.mistry
22 months ago by
United States
meeta.mistry20 wrote:


I encountered a problem when mapping Ensembl genes to Entrez IDs and was wondering if there was a way around this. For a list of Ensembl gene IDs I used the select function to return to me gene symbols and Entrez IDs. 

common_genes <- select(EnsDb.Mmusculus.v79, keys=common, 
        columns=c("ENTREZID", "SYMBOL", "GENE_ID"), 

Browsing through the table I noticed duplicate matches returned (i.e. for a singe Ensembl ID there are two Entrez IDs). I searched these IDs in the Entrez database and found that they are pseudogenes and in fact have different gene symbols but are not reported that way in output.

For example:

72 ENSMUSG00000000740    270106  Rpl13
73 ENSMUSG00000000740 100040416  Rpl13

The second EntrezID is for Rpl13-ps6 which maps to ENSMUSG00000059776; and so this table is reporting incorrectly.

Is there anyway of identifying these pseudogenes using information stored in the database. Perhaps if there are Entrez gene symbols stored we could use those to filter out pseudogenes?

Any help on this would be much appreciated. Thanks in advance.





ensembldb • 322 views
ADD COMMENTlink modified 22 months ago • written 22 months ago by meeta.mistry20
Answer: ensembldb and pseudogenes mapping to the same Ensembl ID
gravatar for Johannes Rainer
22 months ago by
Johannes Rainer1.5k
Johannes Rainer1.5k wrote:

Dear Meeta,

mapping between Entrez and Ensembl IDs is always problematic. EnsDb databases provide you with all the information from Ensembl (for a specific release) and in version 79 (March 2015) this one gene was annotated to two Entrez identifiers. Unfortunately, in EnsDb databases, there is no additional information about Entrez genes available (such as whether an Entrez gene is a pseudogene). For the mapping you could also use the package instead (that uses annotations from NCBI):

> library(
> select(, columns = c("ENTREZID", "SYMBOL", "ENSEMBL"), keys = "Rpl13", keytype = "SYMBOL")
'select()' returned 1:1 mapping between keys and columns
1  Rpl13   270106 ENSMUSG00000000740


Or, alternatively, use an EnsDb database for a more recent Ensembl release (seems it was fixed in the more recent release):

> library(AnnotationHub)
> edb <- query(AnnotationHub(), "EnsDb.Mmusculus.v90")[[1]]
snapshotDate(): 2017-10-27
loading from cache '/Users/jo//.AnnotationHub/64508'
> select(edb, columns = c("ENTREZID", "SYMBOL", "GENEID"), keys = "Rpl13", keytype = "SYMBOL")
1   270106  Rpl13 ENSMUSG00000000740


At last my session info:

> sessionInfo()
R version 3.4.3 (2017-11-30)
Platform: x86_64-apple-darwin17.3.0/x86_64 (64-bit)
Running under: macOS High Sierra 10.13.3

Matrix products: default
BLAS: /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib
LAPACK: /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libLAPACK.dylib

[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] parallel  stats4    stats     graphics  grDevices utils     datasets
[8] methods   base     

other attached packages:
 [1] ensembldb_2.2.0        AnnotationFilter_1.2.0 GenomicFeatures_1.30.0
 [4] GenomicRanges_1.30.1   GenomeInfoDb_1.14.0    AnnotationHub_2.10.1  
 [7]     AnnotationDbi_1.40.0   IRanges_2.12.0        
[10] S4Vectors_0.16.0       Biobase_2.38.0         BiocGenerics_0.24.0   
[13] BiocInstaller_1.28.0  

loaded via a namespace (and not attached):
 [1] SummarizedExperiment_1.8.1    progress_1.1.2               
 [3] lattice_0.20-35               htmltools_0.3.6              
 [5] rtracklayer_1.38.2            yaml_2.1.16                  
 [7] interactiveDisplayBase_1.16.0 blob_1.1.0                   
 [9] XML_3.98-1.9                  rlang_0.1.6                  
[11] pillar_1.0.1                  DBI_0.7                      
[13] BiocParallel_1.12.0           bit64_0.9-7                  
[15] matrixStats_0.52.2            GenomeInfoDbData_1.0.0       
[17] ProtGenerics_1.10.0           stringr_1.2.0                
[19] zlibbioc_1.24.0               Biostrings_2.46.0            
[21] memoise_1.1.0                 biomaRt_2.34.1               
[23] httpuv_1.3.5                  curl_3.1                     
[25] Rcpp_0.12.14                  xtable_1.8-2                 
[27] DelayedArray_0.4.1            XVector_0.18.0               
[29] mime_0.5                      bit_1.1-12                   
[31] Rsamtools_1.30.0              RMySQL_0.10.13               
[33] digest_0.6.13                 stringi_1.1.6                
[35] shiny_1.0.5                   grid_3.4.3                   
[37] tools_3.4.3                   bitops_1.0-6                 
[39] magrittr_1.5                  lazyeval_0.2.1               
[41] RCurl_1.95-4.10               tibble_1.4.1                 
[43] RSQLite_2.0                   pkgconfig_2.0.1              
[45] Matrix_1.2-12                 prettyunits_1.0.2            
[47] assertthat_0.2.0              httr_1.3.1                   
[49] R6_2.2.2                      GenomicAlignments_1.14.1     
[51] compiler_3.4.3               


ADD COMMENTlink written 22 months ago by Johannes Rainer1.5k
Answer: ensembldb and pseudogenes mapping to the same Ensembl ID
gravatar for meeta.mistry
22 months ago by
United States
meeta.mistry20 wrote:

Hi Johannes,

Thank you for your quick reply! Both of those alternatives are good to know and very helpful since I use this package often for cross-database annotations.



ADD COMMENTlink written 22 months ago by meeta.mistry20
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 16.09
Traffic: 383 users visited in the last hour