Search
Question: Is mapId supposed to concatenate multiVals?
0
gravatar for Steve Lianoglou
6 months ago by
Genentech
Steve Lianoglou12k wrote:

I used the AnnotationDbi::mapIds function over the EnsDb.Mmusculus.v79 package to map Ensembl Gene IDs back to entrez id over a a long vector of ENSG Ids.

I expected this to return a 1:1 mapping when mapIds(..., multiVals='first'), but was surprised that this returned several entrez ids concatenated with ";" for a given ensembl gene id, for instance:

R> mapIds(EnsDb.Mmusculus.v79, 'ENSMUSG00000079658', 'ENTREZID', 'GENEID')
ENSMUSG00000079658 
 "67923;102642819" 

I've long been working under the assumption that the multiVals parameter is meant to control this, and it should only return a single identifier when the appropriate value for that parameter is passed (like 'first', 'last', or 'asNA', even)

I must say, this finding has shaken the bedrock of all things I thought to be true and I'm having a deja vu moment back to 1999 where I'm asking myself again if I actually might be living inside of The Matrix.

Can I get an assist? Thanks :-)

I'm running on the latest bioc, but just to orient ourselves a bit, here some versions of the relevant packages:

EnsDb.Mmusculus.v79_2.1.0
ensembldb_2.0.1
AnnotationDbi_1.38.0

 

 

ADD COMMENTlink modified 5 months ago by Johannes Rainer1.0k • written 6 months ago by Steve Lianoglou12k
2
gravatar for James W. MacDonald
6 months ago by
United States
James W. MacDonald45k wrote:

It's in the database that way:

> con <- dbConnect(SQLite(), paste0(path.package("EnsDb.Mmusculus.v79"), "/extdata/EnsDb.Mmusculus.v79.sqlite"))
> dbGetQuery(con, "select * from gene where gene_id='ENSMUSG00000079658';")
             gene_id gene_name        entrezid   gene_biotype gene_seq_start
1 ENSMUSG00000079658     Tceb1 67923;102642819 protein_coding       16641725
  gene_seq_end seq_name seq_strand seq_coord_system
1     16657042        1         -1       chromosome

Which I would imagine reflects a disagreement between EBI and NCBI as to what is and is not Tceb1.

ADD COMMENTlink written 6 months ago by James W. MacDonald45k

that's right. Unfortunately I'm currently concatenating entrezgene identifiers for the same gene using a ; in the database. That might change in the future, or, when I find the time to redo the database layout.

ADD REPLYlink written 6 months ago by Johannes Rainer1.0k
1

I usually try to stick with whatever group's ID I have in hand, rather than trying to cross-match, because these conflicts are inevitable. So if I have Ensembl IDs, I use the EnsDb packages or biomaRt for annotation. If I have Entrez Gene IDs, then I use the TxDb and org packages for annotation.

This particular gene is a perfect example. Only 67923 is on Chr1. The other Gene ID is just a LOC (LOC102642819), and isn't even part of the latest annotation release (it's part of 105), and according to NCBI is on Chr2. Plus NCBI doesn't even agree on the HUGO symbol:

> select(org.Mm.eg.db, c("67923","102642819"), "SYMBOL")
'select()' returned 1:1 mapping between keys and columns
   ENTREZID       SYMBOL
1     67923         Eloc
2 102642819 LOC102642819

Instead claiming that Tceb1 is an alias.

> select(org.Mm.eg.db, c("67923","102642819"), "ALIAS")
'select()' returned 1:many mapping between keys and columns
   ENTREZID         ALIAS
1     67923 2610043E24Rik
2     67923 2610301I15Rik
3     67923      AA407206
4     67923      AI987979
5     67923      AW049146
6     67923         Tceb1
7     67923          Eloc
8 102642819  LOC102642819
ADD REPLYlink written 6 months ago by James W. MacDonald45k

Thanks for the quick feedback, James and Johannes.

FWIW, in the meantime I'm going with using biomaRt to map these ... turns out Tceb1 is actually called Eloc now anyways ;-)

 

ADD REPLYlink written 6 months ago by Steve Lianoglou12k
2
gravatar for Johannes Rainer
5 months ago by
Johannes Rainer1.0k
Italy
Johannes Rainer1.0k wrote:

Hi Steve,

I've fixed the issue with the concatenated Entrezgene IDs. What you need is an ensembldb version >= 2.0.2 (in Bioc 3.5, or the newest Bioc devel version) and an updated EnsDb database/package. Might take some days until the updated EnsDb packages are online (most likely only available in BioC 3.6-devel though), but you can fetch the new EnsDbs already from AnnotationHub (for Ensembl releases 87 and 88):

> library(AnnotationHub)
> ah <- AnnotationHub()
snapshotDate(): 2017-04-25
## Get the (updated) EnsDb for Mus Musculus and Ensembl version 87
> edb <- query(ah, c("EnsDb", "Mus Musculus", "87"))[[1]]
require(“ensembldb”)
downloading from 'https://annotationhub.bioconductor.org/fetch/59960'
retrieving 1 resource
  |======================================================================| 100%
## Check DBSCHEMAVERSION, it should be 2.0 for it to work
> edb
EnsDb for Ensembl:
|Backend: SQLite
|Db type: EnsDb
|Type of Gene ID: Ensembl Gene ID
|Supporting package: ensembldb
|Db created by: ensembldb package from Bioconductor
|script_version: 0.3.0
|Creation time: Sun May 21 00:52:00 2017
|ensembl_version: 87
|ensembl_host: localhost
|Organism: mus_musculus
|taxonomy_id: 10090
|genome_build: GRCm38
|DBSCHEMAVERSION: 2.0
| No. of genes: 50143.
| No. of transcripts: 124168.
|Protein data available.
> mapIds(edb, 'ENSMUSG00000079658', 'ENTREZID', 'GENEID')
ENSMUSG00000079658
             67923

Interestingly, in Ensembl 87 this gene is annotated to only a signle Entrezgene.

> mapIds(edb, 'ENSMUSG00000079658', 'ENTREZID', 'GENEID', multiVals = "list")
$ENSMUSG00000079658
[1] 67923


But it works for others with multimappings:

> mapIds(edb, 'ENSMUSG00000091318', 'ENTREZID', 'GENEID', multiVals = "list")
$ENSMUSG00000091318
[1] 408191 408192

 

Hope this simplifies your workflow.

 

cheers, jo

ADD COMMENTlink written 5 months ago by Johannes Rainer1.0k

Thank you for taking the time to do that, Johannes!

ADD REPLYlink written 5 months ago by Steve Lianoglou12k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.2.0
Traffic: 177 users visited in the last hour