Annotating DEseq output using AnnotationDbi mapIds and most results say NA
Entering edit mode
Mike ▴ 10
Last seen 6 weeks ago

I have a list of ~31,000 mouse transcripts with their Ensembl transcript IDs that I'm trying to annotate using AnnotationDbi and the database. R v3.3.2, the object containing the IDs is called "temp". My command is:

mapIds(, keys=row.names(temp), keytype="ENSEMBLTRANS", column="SYMBOL", multiVals="first")

Only ~8500 get annotated with a gene name/symbol while the rest get "NA". If I search some of the NAs on Ensembl they match to transcripts/genes correctly. Some examples:

Transcript ID from "temp" Result from mapIDs Link to Ensembl record
ENSMUST00000000001 Gnai3 Ensembl
ENSMUST00000000028 NA Ensembl
ENSMUST00000000049 NA Ensembl
ENSMUST00000000058 Cav2 Ensembl

You can see that even the 2 NAs have Ensembl transcript records so why are they not getting annotated by AnnotationDbi?

The command also outputs this, which I'm not sure is relevant or something to worry about:

'select()' returned 1:many mapping between keys and columns


AnnotationDbi mapIds ENSEMBLTRANS • 2.7k views
Entering edit mode
Last seen 31 minutes ago
United States

The package is based on NCBI annotations, whereas you are trying to annotate using Ensembl IDs. Trying to use one annotation source to annotate another is a recipe for heartache, for a number of reasons. You will be better off just using Ensembl databases to do the mapping. You can do that by either using the EnsDb packages that Johannes Rainier provides, or the biomaRt package:

> library(EnsDb.Mmusculus.v79)

> mapIds(EnsDb.Mmusculus.v79, ids, "SYMBOL","TXNAME")
ENSMUST00000000001 ENSMUST00000000028 ENSMUST00000000049 ENSMUST00000000058
           "Gnai3"            "Cdc45"             "Apoh"             "Cav2"

Or maybe more usefully

> select(EnsDb.Mmusculus.v79, ids, "SYMBOL","TXNAME")
              TXNAME SYMBOL               TXID
1 ENSMUST00000000001  Gnai3 ENSMUST00000000001
2 ENSMUST00000000028  Cdc45 ENSMUST00000000028
3 ENSMUST00000000049   Apoh ENSMUST00000000049
4 ENSMUST00000000058   Cav2 ENSMUST00000000058

Or using biomaRt

> library(biomaRt)
> mart <- useMart("ENSEMBL_MART_ENSEMBL", "mmusculus_gene_ensembl")
> getBM(c("mgi_symbol","ensembl_transcript_id"), "ensembl_transcript_id", ids, mart)
  mgi_symbol ensembl_transcript_id
1      Gnai3    ENSMUST00000000001
2      Cdc45    ENSMUST00000000028
3       Apoh    ENSMUST00000000049
4       Cav2    ENSMUST00000000058
Entering edit mode

Thank you it's working much better now but still missing about 10% of them. Actually it's matching everything up to ENSMUST00000195885 and getting NA for all subsequent transcript IDs, here are the 10 around ENSMUST00000195885:

ENSMUST00000195877 RP24-144H23.5
ENSMUST00000195879 RP23-415F2.1
ENSMUST00000195880 RP24-429G21.6
ENSMUST00000195881 RP23-379F6.3
ENSMUST00000195885 RP24-336M14.2
ENSMUST00000195892 NA
ENSMUST00000195897 NA
ENSMUST00000195905 NA
ENSMUST00000195908 NA
ENSMUST00000195914 NA

Command is:

mapIds(EnsDb.Mmusculus.v79, keys=row.names(temp), column="SYMBOL", keytype="TXNAME", multiVals="first")

Also one of the results is now blank: ENSMUST00000077235

When using it correctly finds Dhrsx (Ensembl link).

Entering edit mode

If you want more recent transcripts, you need to use a more recent version of the Ensembl database. The version that Johannes provides is based on Ensembl V79 (hence the v79 in the name), which is rather old. Biomart is based on the current version:

> ids <- paste0("ENSMUST00000", c(195892, 195897, 195905, 195908, 195914))
> mart <- useMart("ENSEMBL_MART_ENSEMBL", "mmusculus_gene_ensembl")
> getBM(c("mgi_symbol","ensembl_transcript_id"), "ensembl_transcript_id", ids, mart)
  mgi_symbol ensembl_transcript_id
1     Gm9484    ENSMUST00000195892
2    Gm44357    ENSMUST00000195897
3      Frrs1    ENSMUST00000195905
4    Gm42630    ENSMUST00000195908
5    Gm43174    ENSMUST00000195914

And if we check an archived version 79

> mart2 <- useMart("ENSEMBL_MART_ENSEMBL", "mmusculus_gene_ensembl", "")
> getBM(c("mgi_symbol","ensembl_transcript_id"), "ensembl_transcript_id", ids, mart2)
[1] mgi_symbol            ensembl_transcript_id
<0 rows> (or 0-length row.names)
Entering edit mode

The reason for the missing entry might be that in Ensembl version 79 the transcript/gene was not annotated yet to that symbol. Locally I have EnsDb.Mmusculus.v87 and there it is annotated to DHRSX.


Entering edit mode

You beat me by 3 minutes James ;)

Mike, if you need the new EnsDb just drop me a line.

cheers, jo

Entering edit mode

Hi Johannes,

Any chance you can make EnsDb.Mmusculus.v87 available through the Bioconductor annotation pages? Or any other way? For my data set I (also) would like to make use of the latest annotation info available. :)



Entering edit mode

Actually, with the current development version it would be possible to get EnsDb for all species from Ensembl 87 from AnnotationHub

ah <- AnnotationHub()
> query(ah, c("EnsDb", "mus musculus", "87"))
AnnotationHub with 1 record
# snapshotDate(): 2017-02-07
# names(): AH53222
# $dataprovider: Ensembl
# $species: Mus Musculus
# $rdataclass: EnsDb
# $title: Ensembl 87 EnsDb for Mus Musculus
# $description: Gene and protein annotations for Mus Musculus based on Ensem...
# $taxonomyid: 10090
# $genome: GRCm38
# $sourcetype: ensembl
# $sourceurl:
# $sourcelastmodifieddate: NA
# $sourcesize: NA
# $tags: c("EnsDb", "Ensembl", "Gene", "Transcript", "Protein",
#   "Annotation", "87", "AHEnsDbs")
# retrieve record with 'object[["AH53222"]]'

As said, that's in the developmental BioC (version 3.5), so, not yet officially available.

In the meantime you can download the corresponding SQLite file from - but beware - download will be slow. You can use then the corresponding EnsDb by using the EnsDb function passing the file name of the SQLite file as argument (full path).

Entering edit mode

Thanks! Will go for the 1st option!


Login before adding your answer.

Traffic: 284 users visited in the last hour
Help About
Access RSS

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6