Question: error same ensembl ID in different genes
gravatar for
12 months ago by
fengshou.ma10 wrote:
select(, keys = "MIR15A", keytype = "SYMBOL", columns = c("SYMBOL","GENENAME","ENSEMBL"))
'select()' returned 1:1 mapping between keys and columns
1 MIR15A microRNA 15a ENSG00000231607

select(, keys = "DLEU2", keytype = "SYMBOL", columns = c("SYMBOL","GENENAME","ENSEMBL"))
'select()' returned 1:1 mapping between keys and columns
  SYMBOL                                               GENENAME         ENSEMBL
1  DLEU2 deleted in lymphocytic leukemia 2 (non-protein coding) ENSG00000231607

But the ensembl id of MIR15A is  ENSG00000283785  not ENSG00000231607.



R version 3.4.0 (2017-04-21)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 14393)

Matrix products: default

[1] LC_COLLATE=Chinese (Simplified)_China.936  LC_CTYPE=Chinese (Simplified)_China.936    LC_MONETARY=Chinese (Simplified)_China.936
[4] LC_NUMERIC=C                               LC_TIME=Chinese (Simplified)_China.936    

attached base packages:
[1] parallel  stats4    stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1]    AnnotationDbi_1.38.2  IRanges_2.10.2        S4Vectors_0.14.3      Biobase_2.36.2        BiocGenerics_0.22.0  
[7] clusterProfiler_3.4.4 DOSE_3.2.0           

loaded via a namespace (and not attached):
 [1] Rcpp_0.12.12        compiler_3.4.0      plyr_1.8.4          tools_3.4.0         digest_0.6.12       bit_1.1-12          RSQLite_2.0        
 [8] memoise_1.1.0       tibble_1.3.3        gtable_0.2.0        pkgconfig_2.0.1     rlang_0.1.2         fastmatch_1.1-0     igraph_1.1.2       
[15] DBI_0.7             rvcheck_0.0.9       fgsea_1.2.1         gridExtra_2.2.1     stringr_1.2.0       bit64_0.9-7         grid_3.4.0         
[22] glue_1.1.1          qvalue_2.8.0        data.table_1.10.4   BiocParallel_1.10.1 GOSemSim_2.2.0      purrr_0.2.3         tidyr_0.7.0        
[29] GO.db_3.4.1         DO.db_2.9           ggplot2_2.2.1       reshape2_1.4.2      blob_1.1.0          magrittr_1.5        splines_3.4.0      
[36] scales_0.4.1        colorspace_1.3-2    stringi_1.1.5       lazyeval_0.2.0      munsell_0.4.3      


ADD COMMENTlink modified 12 months ago by daniel.vantwisk30 • written 12 months ago by fengshou.ma10
gravatar for Johannes Rainer
12 months ago by
Johannes Rainer1.3k
Johannes Rainer1.3k wrote:

If you're working with Ensembl annotations I would stick to annotation resources that were built on Ensembl provided data (such as biomaRt or ensembldb). This also avoids potential problems and multi-mappings between Ensembl and NCBI. AFAIK the .eg. packages are built using information from NCBI and some discrepancies might be explained by the mapping between the databases (NCBI <-> Ensembl).

cheers, jo


The mapping using ensembldb:

> library(AnnotationHub)
> ah <- AnnotationHub()
snapshotDate(): 2017-04-25
> query(ah, c("EnsDb", "Hsapiens"))
AnnotationHub with 2 records
# snapshotDate(): 2017-04-25
# $dataprovider: Ensembl
# $species: Homo Sapiens
# $rdataclass: EnsDb
# additional mcols(): taxonomyid, genome, description,
#   coordinate_1_based, maintainer, rdatadateadded, preparerclass, tags,
#   rdatapath, sourceurl, sourcetype
# retrieve records with, e.g., 'object[["AH53211"]]'
  AH53211 | Ensembl 87 EnsDb for Homo Sapiens
  AH53715 | Ensembl 88 EnsDb for Homo Sapiens
> edb <- ah[["AH53715"]]
loading from cache '/Users/jo//.AnnotationHub/60453'
> select(edb, keys = "MIR15A", keytype = "SYMBOL", columns = c("SYMBOL","DESCRIPTION","GENEID"))
  SYMBOL                                      DESCRIPTION          GENEID
1 MIR15A microRNA 15a [Source:HGNC Symbol;Acc:HGNC:31543] ENSG00000283785
ADD COMMENTlink written 12 months ago by Johannes Rainer1.3k
gravatar for daniel.vantwisk
12 months ago by
daniel.vantwisk30 wrote:

The mapping is not incorrect, but based on older resources from NCBI.  The most recent version of was built from resources from NCBI on March 29th 2017.  We do not continuously rebuild our annotation resources so that we can allow researchers who are using these resources to get reproducible results (so that the results from using an annotation package does not change from day to day).  If you are looking for a more up-to-date way of obtaining annotation information, you can use a Bioconductor package that accesses NCBI's API.  Below I've included two examples.  The first shows the date that the annotation resource was built for an org package.  The second shows a method of obtaining the most up-to-date annotation information using biomaRt.

Here, the attribute EGSOURCEDATE shows the date the annotation information was obtained from NCBI to build the pacakge.

#> OrgDb object:
#> | Db type: OrgDb
#> | Supporting package: AnnotationDbi
#> | ORGANISM: Homo sapiens
#> | SPECIES: Human
#> | EGSOURCEDATE: 2017-Mar29
#> | EGSOURCENAME: Entrez Gene
#> | TAXID: 9606
#> | GOSOURCENAME: Gene Ontology
#> | GOSOURCEDATE: 2017-Mar29
#> | GOEGSOURCEDATE: 2017-Mar29
#> | GOEGSOURCENAME: Entrez Gene
#> | KEGGSOURCEDATE: 2011-Mar15
#> | GPSOURCENAME: UCSC Genome Bioinformatics (Homo sapiens)
#> | GPSOURCEDATE: 2017-Mar17
#> | ENSOURCEDATE: 2017-Mar29
#> | ENSOURCENAME: Ensembl
#> | UPSOURCENAME: Uniprot
#> | UPSOURCEDATE: Wed Apr  5 02:52:37 2017
#> Please see: help('select') for usage information

Here, we use biomaRt to obtain the most up-to-date annotation information.

ensembl <- useMart("ensembl")
ensembl <- useDataset("hsapiens_gene_ensembl",mart=ensembl)
getBM(attributes=c("hgnc_symbol", "ensembl_gene_id"),
    values= c('MIR15A','DLEU2'),
#>   hgnc_symbol ensembl_gene_id
#> 1       DLEU2 ENSG00000231607
#> 2      MIR15A ENSG00000283785
ADD COMMENTlink modified 12 months ago • written 12 months ago by daniel.vantwisk30

In addition, do note that the orgDb packages that we supply are based on mappings from Entrez Gene IDs to all other annotation sources, and if you are trying to map from NCBI IDs to EBI IDs you will always run into disagreements between the annotation groups. To wit:

> getBM(c("hgnc_symbol","entrezgene","ensembl_gene_id"), "hgnc_symbol", c("MIR15A","DLEU2"), mart)
  hgnc_symbol entrezgene ensembl_gene_id
1       DLEU2         NA ENSG00000231607
2      MIR15A     406948 ENSG00000283785

So EBI doesn't seem to recognize that there is an Entrez Gene ID for DLEU2. But miR15A and DLEU2 are overlapping genes (miR15A comes from an intron of DLEU2), so you can end up with positional mappings that may not make sense if you look at an individual mapping, but that may be programmatically convenient for a group (say NCBI or EBI) who is trying to do a cross-mapping when their isn't agreement between them.

NCBI may have cleaned this one up, but rest assured there are many others. So as Jo already noted, if you want Ensembl IDs, use EBI based annotation packages. If you want Entrez Gene IDs, use NCBI based annotation packages. And probably don't use gene symbols for much if possible.

ADD REPLYlink written 12 months ago by James W. MacDonald47k
gravatar for
12 months ago by
fengshou.ma10 wrote:

It seems that all miRNA's ensembl id is wrong.

ADD COMMENTlink written 12 months ago by fengshou.ma10
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.2.0
Traffic: 453 users visited in the last hour