Question: Annotating DEseq output using AnnotationDbi mapIds and most results say NA
gravatar for Mike
8 days ago by
Mike0 wrote:

I have a list of ~31,000 mouse transcripts with their Ensembl transcript IDs that I'm trying to annotate using AnnotationDbi and the database. R v3.3.2, the object containing the IDs is called "temp". My command is:

mapIds(, keys=row.names(temp), keytype="ENSEMBLTRANS", column="SYMBOL", multiVals="first")

Only ~8500 get annotated with a gene name/symbol while the rest get "NA". If I search some of the NAs on Ensembl they match to transcripts/genes correctly. Some examples:

Transcript ID from "temp" Result from mapIDs Link to Ensembl record
ENSMUST00000000001 Gnai3 Ensembl
ENSMUST00000000028 NA Ensembl
ENSMUST00000000049 NA Ensembl
ENSMUST00000000058 Cav2 Ensembl

You can see that even the 2 NAs have Ensembl transcript records so why are they not getting annotated by AnnotationDbi?

The command also outputs this, which I'm not sure is relevant or something to worry about:

'select()' returned 1:many mapping between keys and columns


ADD COMMENTlink modified 8 days ago by James W. MacDonald41k • written 8 days ago by Mike0
gravatar for James W. MacDonald
8 days ago by
United States
James W. MacDonald41k wrote:

The package is based on NCBI annotations, whereas you are trying to annotate using Ensembl IDs. Trying to use one annotation source to annotate another is a recipe for heartache, for a number of reasons. You will be better off just using Ensembl databases to do the mapping. You can do that by either using the EnsDb packages that Johannes Rainier provides, or the biomaRt package:

> library(EnsDb.Mmusculus.v79)

> mapIds(EnsDb.Mmusculus.v79, ids, "SYMBOL","TXNAME")
ENSMUST00000000001 ENSMUST00000000028 ENSMUST00000000049 ENSMUST00000000058
           "Gnai3"            "Cdc45"             "Apoh"             "Cav2"

Or maybe more usefully

> select(EnsDb.Mmusculus.v79, ids, "SYMBOL","TXNAME")
              TXNAME SYMBOL               TXID
1 ENSMUST00000000001  Gnai3 ENSMUST00000000001
2 ENSMUST00000000028  Cdc45 ENSMUST00000000028
3 ENSMUST00000000049   Apoh ENSMUST00000000049
4 ENSMUST00000000058   Cav2 ENSMUST00000000058

Or using biomaRt

> library(biomaRt)
> mart <- useMart("ENSEMBL_MART_ENSEMBL", "mmusculus_gene_ensembl")
> getBM(c("mgi_symbol","ensembl_transcript_id"), "ensembl_transcript_id", ids, mart)
  mgi_symbol ensembl_transcript_id
1      Gnai3    ENSMUST00000000001
2      Cdc45    ENSMUST00000000028
3       Apoh    ENSMUST00000000049
4       Cav2    ENSMUST00000000058
ADD COMMENTlink written 8 days ago by James W. MacDonald41k

Thank you it's working much better now but still missing about 10% of them. Actually it's matching everything up to ENSMUST00000195885 and getting NA for all subsequent transcript IDs, here are the 10 around ENSMUST00000195885:

ENSMUST00000195877 RP24-144H23.5
ENSMUST00000195879 RP23-415F2.1
ENSMUST00000195880 RP24-429G21.6
ENSMUST00000195881 RP23-379F6.3
ENSMUST00000195885 RP24-336M14.2
ENSMUST00000195892 NA
ENSMUST00000195897 NA
ENSMUST00000195905 NA
ENSMUST00000195908 NA
ENSMUST00000195914 NA

Command is:

mapIds(EnsDb.Mmusculus.v79, keys=row.names(temp), column="SYMBOL", keytype="TXNAME", multiVals="first")

Also one of the results is now blank: ENSMUST00000077235

When using it correctly finds Dhrsx (Ensembl link).

ADD REPLYlink written 8 days ago by Mike0

If you want more recent transcripts, you need to use a more recent version of the Ensembl database. The version that Johannes provides is based on Ensembl V79 (hence the v79 in the name), which is rather old. Biomart is based on the current version:

> ids <- paste0("ENSMUST00000", c(195892, 195897, 195905, 195908, 195914))
> mart <- useMart("ENSEMBL_MART_ENSEMBL", "mmusculus_gene_ensembl")
> getBM(c("mgi_symbol","ensembl_transcript_id"), "ensembl_transcript_id", ids, mart)
  mgi_symbol ensembl_transcript_id
1     Gm9484    ENSMUST00000195892
2    Gm44357    ENSMUST00000195897
3      Frrs1    ENSMUST00000195905
4    Gm42630    ENSMUST00000195908
5    Gm43174    ENSMUST00000195914

And if we check an archived version 79

> mart2 <- useMart("ENSEMBL_MART_ENSEMBL", "mmusculus_gene_ensembl", "")
> getBM(c("mgi_symbol","ensembl_transcript_id"), "ensembl_transcript_id", ids, mart2)
[1] mgi_symbol            ensembl_transcript_id
<0 rows> (or 0-length row.names)
ADD REPLYlink written 8 days ago by James W. MacDonald41k

The reason for the missing entry might be that in Ensembl version 79 the transcript/gene was not annotated yet to that symbol. Locally I have EnsDb.Mmusculus.v87 and there it is annotated to DHRSX.


ADD REPLYlink written 8 days ago by Johannes Rainer650

You beat me by 3 minutes James ;)

Mike, if you need the new EnsDb just drop me a line.

cheers, jo

ADD REPLYlink written 8 days ago by Johannes Rainer650
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.2.0
Traffic: 128 users visited in the last hour