Question

ensembldb and pseudogenes mapping to the same Ensembl ID

1

Entering edit mode

meeta.mistry ▴ 30

@meetamistry-7355

Last seen 21 months ago

United States

Hello,

I encountered a problem when mapping Ensembl genes to Entrez IDs and was wondering if there was a way around this. For a list of Ensembl gene IDs I used the select function to return to me gene symbols and Entrez IDs.

common_genes <- select(EnsDb.Mmusculus.v79, keys=common, 
        columns=c("ENTREZID", "SYMBOL", "GENE_ID"), 
        keytype="GENEID")

Browsing through the table I noticed duplicate matches returned (i.e. for a singe Ensembl ID there are two Entrez IDs). I searched these IDs in the Entrez database and found that they are pseudogenes and in fact have different gene symbols but are not reported that way in output.

For example:

               GENEID  ENTREZID SYMBOL
72 ENSMUSG00000000740    270106  Rpl13
73 ENSMUSG00000000740 100040416  Rpl13

The second EntrezID is for Rpl13-ps6 which maps to ENSMUSG00000059776; and so this table is reporting incorrectly.

Is there anyway of identifying these pseudogenes using information stored in the database. Perhaps if there are Entrez gene symbols stored we could use those to filter out pseudogenes?

Any help on this would be much appreciated. Thanks in advance.

Meeta

ensembldb • 1.3k views

ADD COMMENT • link 6.3 years ago meeta.mistry ▴ 30

score 1 · Answer 1 · 2018-01-09

Dear Meeta,

mapping between Entrez and Ensembl IDs is always problematic. EnsDb databases provide you with all the information from Ensembl (for a specific release) and in version 79 (March 2015) this one gene was annotated to two Entrez identifiers. Unfortunately, in EnsDb databases, there is no additional information about Entrez genes available (such as whether an Entrez gene is a pseudogene). For the mapping you could also use the org.Mm.eg.db package instead (that uses annotations from NCBI):

> library(org.Mm.eg.db)
> select(org.Mm.eg.db, columns = c("ENTREZID", "SYMBOL", "ENSEMBL"), keys = "Rpl13", keytype = "SYMBOL")
'select()' returned 1:1 mapping between keys and columns
  SYMBOL ENTREZID            ENSEMBL
1  Rpl13   270106 ENSMUSG00000000740

Or, alternatively, use an EnsDb database for a more recent Ensembl release (seems it was fixed in the more recent release):

> library(AnnotationHub)
> edb <- query(AnnotationHub(), "EnsDb.Mmusculus.v90")[[1]]
snapshotDate(): 2017-10-27
loading from cache '/Users/jo//.AnnotationHub/64508'
> select(edb, columns = c("ENTREZID", "SYMBOL", "GENEID"), keys = "Rpl13", keytype = "SYMBOL")
  ENTREZID SYMBOL             GENEID
1   270106  Rpl13 ENSMUSG00000000740

At last my session info:

> sessionInfo()
R version 3.4.3 (2017-11-30)
Platform: x86_64-apple-darwin17.3.0/x86_64 (64-bit)
Running under: macOS High Sierra 10.13.3

Matrix products: default
BLAS: /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib
LAPACK: /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libLAPACK.dylib

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] parallel  stats4    stats     graphics  grDevices utils     datasets
[8] methods   base     

other attached packages:
 [1] ensembldb_2.2.0        AnnotationFilter_1.2.0 GenomicFeatures_1.30.0
 [4] GenomicRanges_1.30.1   GenomeInfoDb_1.14.0    AnnotationHub_2.10.1  
 [7] org.Mm.eg.db_3.5.0     AnnotationDbi_1.40.0   IRanges_2.12.0        
[10] S4Vectors_0.16.0       Biobase_2.38.0         BiocGenerics_0.24.0   
[13] BiocInstaller_1.28.0  

loaded via a namespace (and not attached):
 [1] SummarizedExperiment_1.8.1    progress_1.1.2               
 [3] lattice_0.20-35               htmltools_0.3.6              
 [5] rtracklayer_1.38.2            yaml_2.1.16                  
 [7] interactiveDisplayBase_1.16.0 blob_1.1.0                   
 [9] XML_3.98-1.9                  rlang_0.1.6                  
[11] pillar_1.0.1                  DBI_0.7                      
[13] BiocParallel_1.12.0           bit64_0.9-7                  
[15] matrixStats_0.52.2            GenomeInfoDbData_1.0.0       
[17] ProtGenerics_1.10.0           stringr_1.2.0                
[19] zlibbioc_1.24.0               Biostrings_2.46.0            
[21] memoise_1.1.0                 biomaRt_2.34.1               
[23] httpuv_1.3.5                  curl_3.1                     
[25] Rcpp_0.12.14                  xtable_1.8-2                 
[27] DelayedArray_0.4.1            XVector_0.18.0               
[29] mime_0.5                      bit_1.1-12                   
[31] Rsamtools_1.30.0              RMySQL_0.10.13               
[33] digest_0.6.13                 stringi_1.1.6                
[35] shiny_1.0.5                   grid_3.4.3                   
[37] tools_3.4.3                   bitops_1.0-6                 
[39] magrittr_1.5                  lazyeval_0.2.1               
[41] RCurl_1.95-4.10               tibble_1.4.1                 
[43] RSQLite_2.0                   pkgconfig_2.0.1              
[45] Matrix_1.2-12                 prettyunits_1.0.2            
[47] assertthat_0.2.0              httr_1.3.1                   
[49] R6_2.2.2                      GenomicAlignments_1.14.1     
[51] compiler_3.4.3

score 0 · Answer 2 · 2018-01-09

0