Question

Is mapId supposed to concatenate multiVals?

0

Entering edit mode

Steve Lianoglou ★ 13k

@steve-lianoglou-2771

Last seen 14 months ago

United States

I used the AnnotationDbi::mapIds function over the EnsDb.Mmusculus.v79 package to map Ensembl Gene IDs back to entrez id over a a long vector of ENSG Ids.

I expected this to return a 1:1 mapping when mapIds(..., multiVals='first'), but was surprised that this returned several entrez ids concatenated with ";" for a given ensembl gene id, for instance:

R> mapIds(EnsDb.Mmusculus.v79, 'ENSMUSG00000079658', 'ENTREZID', 'GENEID')
ENSMUSG00000079658 
 "67923;102642819"

I've long been working under the assumption that the multiVals parameter is meant to control this, and it should only return a single identifier when the appropriate value for that parameter is passed (like 'first', 'last', or 'asNA', even)

I must say, this finding has shaken the bedrock of all things I thought to be true and I'm having a deja vu moment back to 1999 where I'm asking myself again if I actually might be living inside of The Matrix.

Can I get an assist? Thanks :-)

I'm running on the latest bioc, but just to orient ourselves a bit, here some versions of the relevant packages:

EnsDb.Mmusculus.v79_2.1.0
ensembldb_2.0.1
AnnotationDbi_1.38.0

annotationdbi ensembldb • 2.5k views

ADD COMMENT • link updated 6.9 years ago by Johannes Rainer ★ 2.0k • written 6.9 years ago by Steve Lianoglou ★ 13k

2

Entering edit mode

Johannes Rainer ★ 2.0k

@johannes-rainer-6987

Last seen 21 days ago

Italy

Hi Steve,

I've fixed the issue with the concatenated Entrezgene IDs. What you need is an ensembldb version >= 2.0.2 (in Bioc 3.5, or the newest Bioc devel version) and an updated EnsDb database/package. Might take some days until the updated EnsDb packages are online (most likely only available in BioC 3.6-devel though), but you can fetch the new EnsDbs already from AnnotationHub (for Ensembl releases 87 and 88):

> library(AnnotationHub)
> ah <- AnnotationHub()
snapshotDate(): 2017-04-25
## Get the (updated) EnsDb for Mus Musculus and Ensembl version 87
> edb <- query(ah, c("EnsDb", "Mus Musculus", "87"))[[1]]
require(“ensembldb”)
downloading from 'https://annotationhub.bioconductor.org/fetch/59960'
retrieving 1 resource
  |======================================================================| 100%
## Check DBSCHEMAVERSION, it should be 2.0 for it to work
> edb
EnsDb for Ensembl:
|Backend: SQLite
|Db type: EnsDb
|Type of Gene ID: Ensembl Gene ID
|Supporting package: ensembldb
|Db created by: ensembldb package from Bioconductor
|script_version: 0.3.0
|Creation time: Sun May 21 00:52:00 2017
|ensembl_version: 87
|ensembl_host: localhost
|Organism: mus_musculus
|taxonomy_id: 10090
|genome_build: GRCm38
|DBSCHEMAVERSION: 2.0
| No. of genes: 50143.
| No. of transcripts: 124168.
|Protein data available.
> mapIds(edb, 'ENSMUSG00000079658', 'ENTREZID', 'GENEID')
ENSMUSG00000079658
             67923

Interestingly, in Ensembl 87 this gene is annotated to only a signle Entrezgene.

> mapIds(edb, 'ENSMUSG00000079658', 'ENTREZID', 'GENEID', multiVals = "list")
$ENSMUSG00000079658
[1] 67923

But it works for others with multimappings:

> mapIds(edb, 'ENSMUSG00000091318', 'ENTREZID', 'GENEID', multiVals = "list")
$ENSMUSG00000091318
[1] 408191 408192

Hope this simplifies your workflow.

cheers, jo

ADD COMMENT • link 6.9 years ago Johannes Rainer ★ 2.0k

0

Entering edit mode

Thank you for taking the time to do that, Johannes!

ADD REPLY • link 6.9 years ago Steve Lianoglou ★ 13k

score 2 · Accepted Answer · 2017-05-15

2

Entering edit mode

James W. MacDonald 65k

@james-w-macdonald-5106

Last seen 8 hours ago

United States

It's in the database that way:

> con <- dbConnect(SQLite(), paste0(path.package("EnsDb.Mmusculus.v79"), "/extdata/EnsDb.Mmusculus.v79.sqlite"))
> dbGetQuery(con, "select * from gene where gene_id='ENSMUSG00000079658';")
             gene_id gene_name        entrezid   gene_biotype gene_seq_start
1 ENSMUSG00000079658     Tceb1 67923;102642819 protein_coding       16641725
  gene_seq_end seq_name seq_strand seq_coord_system
1     16657042        1         -1       chromosome

Which I would imagine reflects a disagreement between EBI and NCBI as to what is and is not Tceb1.

ADD COMMENT • link 6.9 years ago James W. MacDonald 65k

0

Entering edit mode

that's right. Unfortunately I'm currently concatenating entrezgene identifiers for the same gene using a ; in the database. That might change in the future, or, when I find the time to redo the database layout.

ADD REPLY • link 6.9 years ago Johannes Rainer ★ 2.0k

1

Entering edit mode

I usually try to stick with whatever group's ID I have in hand, rather than trying to cross-match, because these conflicts are inevitable. So if I have Ensembl IDs, I use the EnsDb packages or biomaRt for annotation. If I have Entrez Gene IDs, then I use the TxDb and org packages for annotation.

This particular gene is a perfect example. Only 67923 is on Chr1. The other Gene ID is just a LOC (LOC102642819), and isn't even part of the latest annotation release (it's part of 105), and according to NCBI is on Chr2. Plus NCBI doesn't even agree on the HUGO symbol:

> select(org.Mm.eg.db, c("67923","102642819"), "SYMBOL")
'select()' returned 1:1 mapping between keys and columns
   ENTREZID       SYMBOL
1     67923         Eloc
2 102642819 LOC102642819

Instead claiming that Tceb1 is an alias.

> select(org.Mm.eg.db, c("67923","102642819"), "ALIAS")
'select()' returned 1:many mapping between keys and columns
   ENTREZID         ALIAS
1     67923 2610043E24Rik
2     67923 2610301I15Rik
3     67923      AA407206
4     67923      AI987979
5     67923      AW049146
6     67923         Tceb1
7     67923          Eloc
8 102642819  LOC102642819

ADD REPLY • link 6.9 years ago James W. MacDonald 65k

0

Entering edit mode

Thanks for the quick feedback, James and Johannes.

FWIW, in the meantime I'm going with using biomaRt to map these ... turns out Tceb1 is actually called Eloc now anyways ;-)

ADD REPLY • link 6.9 years ago Steve Lianoglou ★ 13k