Mapping EntrezId to Ensembl IDs returns NA for pseudogenes and snoRNA
2
0
Entering edit mode
@58262852
Last seen 20 days ago
United States

Hello everybody, I am pretty new to the bioinformatic world and I would really appreciate any advice regarding this issue (and how to properly look for help). So, I have a list of genes from an RNA-Seq experiment in EntrezID that I need to convert to Ensembl Id. I am using annotationDbi with both EnsDb.Hsapiens.v86 and org.Hs.eg.db but I get #NA values for pseudogenes and snoRNA whenever I run the code below. Is there a better way of doing this? By looking online it seems that it is a frequent issue, but it should be able to be solved as I checked several of the unmapped genes and they have Ensembl IDs assigned to them. Thanks in advance!


EnsDb2 <- AnnotationDbi::mapIds(EnsDb.Hsapiens.v86,
keys = Data$gene_id, column = "GENEID", keytype = "ENTREZID", multiVals="first") orgDb_mapID <- AnnotationDbi::mapIds(org.Hs.eg.db, keys = Data$gene_id,
column = "ENSEMBL",
keytype = "ENTREZID")

Session info()

R version 4.0.3 (2020-10-10)
Platform: x86_64-apple-darwin17.0 (64-bit)
Running under: macOS Big Sur 10.16

Matrix products: default
LAPACK: /Library/Frameworks/R.framework/Versions/4.0/Resources/lib/libRlapack.dylib

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] parallel  stats4    stats     graphics  grDevices utils     datasets  methods   base

other attached packages:
[1] EnsDb.Hsapiens.v86_2.99.0 ensembldb_2.14.0          AnnotationFilter_1.14.0   GenomicFeatures_1.42.3
[5] GenomicRanges_1.42.0      GenomeInfoDb_1.26.7       xlsx_0.6.5                org.Hs.eg.db_3.12.0
[9] AnnotationDbi_1.52.0      IRanges_2.24.1            S4Vectors_0.28.1          Biobase_2.50.0
[13] BiocGenerics_0.36.1

loaded via a namespace (and not attached):
[1] Rcpp_1.0.6                  lattice_0.20-41             prettyunits_1.1.1
[4] Rsamtools_2.6.0             xlsxjars_0.6.1              Biostrings_2.58.0
[7] assertthat_0.2.1            utf8_1.2.1                  BiocFileCache_1.14.0
[10] R6_2.5.0                    RSQLite_2.2.6               httr_1.4.2
[13] pillar_1.6.0                zlibbioc_1.36.0             rlang_0.4.10
[16] progress_1.2.2              lazyeval_0.2.2              curl_4.3
[19] rstudioapi_0.13             blob_1.2.1                  Matrix_1.3-2
[22] BiocParallel_1.24.1         stringr_1.4.0               ProtGenerics_1.22.0
[25] RCurl_1.98-1.3              bit_4.0.4                   biomaRt_2.46.3
[28] DelayedArray_0.16.3         compiler_4.0.3              rtracklayer_1.50.0
[34] tidyselect_1.1.0            SummarizedExperiment_1.20.0 tibble_3.1.1
[37] GenomeInfoDbData_1.2.4      matrixStats_0.58.0          XML_3.99-0.6
[40] fansi_0.4.2                 crayon_1.4.1                dplyr_1.0.5
[43] dbplyr_2.1.1                GenomicAlignments_1.26.0    bitops_1.0-6
[46] rappdirs_0.3.3              grid_4.0.3                  lifecycle_1.0.0
[49] DBI_1.1.1                   magrittr_2.0.1              stringi_1.5.3
[52] cachem_1.0.4                XVector_0.30.0              xml2_1.3.2
[55] ellipsis_0.3.1              generics_0.1.0              vctrs_0.3.7
[58] tools_4.0.3                 bit64_4.0.5                 glue_1.4.2
[61] purrr_0.3.4                 hms_1.0.0                   MatrixGenerics_1.2.1
[64] fastmap_1.1.0               BiocManager_1.30.12         memoise_2.0.0
[67] rJava_0.9-13

RNASeqData RNASeq RNASeqR • 158 views
0
Entering edit mode
1
Entering edit mode
@james-w-macdonald-5106
Last seen 11 hours ago
United States

You said

By looking online it seems that it is a frequent issue, but it should be able to be solved as I checked several of the unmapped genes and they have Ensembl IDs assigned to them.

Which is orthogonal to the issue you are having. What you are trying to do is find the corresponding Ensembl ID for an NCBI Gene ID, which means that Ensembl has used some criteria to determine if a particular gene identified by NCBI has an exact, or somewhat close to exact counterpart in the set of Ensembl genes.

I don't know exactly what criteria Ensembl uses to say if one of their genes match an NCBI gene, nor do I know the criteria used by NCBI to match with Ensembl. You could probably look it up if you were interested - I assume it's pretty complex yet surprisingly boring so I myself haven't done so.

But given that it's a pretty well known issue (and that the MANE project has been going for like three years now and still isn't done), it must be pretty complex. I have a long standing recommendation around these parts that people shouldn't try to map between the two annotation services because it's complex and almost surely not relevant to the task at hand, so instead it's better to just start with the annotation service you like, and remain with that service for the entirety of the analysis so you don't have to contend with this issue at all.

So if you need Ensembl IDs, map your reads to the Ensembl or Gencode genome of your choice. If you need NCBI IDs, use NCBI genes.

0
Entering edit mode
@58262852
Last seen 20 days ago
United States

Hey James, I really appreciate you taking the time to answer my question. As you mentioned, looks like "marrying" to one annotation service and stick by it is the way to go. At least for the time being