Search
Question: How to map all Ensembl IDs to Gene Symbols- Problem with AnootationDbi
2
gravatar for gokce.ouz
14 months ago by
gokce.ouz20
gokce.ouz20 wrote:

Hi,

I am analysing my RNA-Seq data with DESeq2. At the end I would like to convert significantly expressed  ensembl IDs to GeneSymbols. I am using AnnotationDbi for this. However, I realized that not all the ensembl IDs are converted to Gene Symbols. 25072 out of 48607 returned as NA. More than 11000 of these IDs are actually significantly differentially expressed .  So to double check, I put the IDs which got "NA" for Gene Symbol to Biomart and it converted them to Gene symbols( as seen in the figure). So now I am confused, am I doing something wrong ? Or is there any other alternative to extract all Gene symbols ?

Thanks in advance,

Gokce 

library("AnnotationDbi")
library("org.Hs.eg.db")
res<-results(dds,alpha=.05, contrast=c("Type", "Disease", "Control"))
res$symbol <- mapIds(org.Hs.eg.db,keys=row.names(res),column="SYMBOL", keytype="ENSEMBL", multiVals="first")

 

> sessionInfo()
R version 3.2.0 (2015-04-16)
Platform: x86_64-unknown-linux-gnu (64-bit)
Running under: CentOS release 6.5 (Final)

locale:
 [1] LC_CTYPE=en_US.UTF-8          LC_NUMERIC=C                 
 [3] LC_TIME=en_US.UTF-8           LC_COLLATE=en_US.UTF-8       
 [5] LC_MONETARY=en_US.UTF-8       LC_MESSAGES=en_US.UTF-8      
 [7] LC_PAPER=en_US.UTF-8          LC_NAME=en_US.UTF-8          
 [9] LC_ADDRESS=en_US.UTF-8        LC_TELEPHONE=en_US.UTF-8     
[11] LC_MEASUREMENT=en_US.UTF-8    LC_IDENTIFICATION=en_US.UTF-8

attached base packages:
[1] parallel  stats4    stats     graphics  grDevices utils     datasets 
[8] methods   base     

other attached packages:
 [1] edgeR_3.10.5              limma_3.24.15            
 [3] amap_0.8-14               sva_3.14.0               
 [5] mgcv_1.8-10               nlme_3.1-122             
 [7] doParallel_1.0.10         iterators_1.0.8          
 [9] foreach_1.4.3             reshape_0.8.5            
[11] cluster_2.0.3             matrixStats_0.50.1       
[13] flashClust_1.01-2         WGCNA_1.51               
[15] fastcluster_1.1.16        dynamicTreeCut_1.62      
[17] xlsx_0.5.7                xlsxjars_0.6.1           
[19] rJava_0.9-7               pheatmap_1.0.8           
[21] genefilter_1.50.0         gplots_2.17.0            
[23] RColorBrewer_1.1-2        vsn_3.36.0               
[25] org.Hs.eg.db_3.1.2        RSQLite_1.0.0            
[27] DBI_0.3.1                 DESeq2_1.8.2             
[29] RcppArmadillo_0.6.400.2.2 Rcpp_0.12.7              
[31] BiocParallel_1.2.22       GenomicAlignments_1.4.2  
[33] GenomicFeatures_1.20.6    AnnotationDbi_1.30.1     
[35] Biobase_2.28.0            Rsamtools_1.20.5         
[37] Biostrings_2.36.4         XVector_0.8.0            
[39] GenomicRanges_1.20.8      GenomeInfoDb_1.4.3       
[41] IRanges_2.2.9             S4Vectors_0.6.6          
[43] BiocGenerics_0.14.0       Hmisc_3.17-1             
[45] ggplot2_2.1.0             Formula_1.2-1            
[47] survival_2.38-3           lattice_0.20-33          
[49] BiocInstaller_1.20.3     

loaded via a namespace (and not attached):
 [1] splines_3.2.0         gtools_3.5.0          affy_1.46.1          
 [4] latticeExtra_0.6-26   impute_1.42.0         colorspace_1.2-6     
 [7] preprocessCore_1.30.0 Matrix_1.2-3          plyr_1.8.4           
[10] XML_3.98-1.3          biomaRt_2.24.1        zlibbioc_1.14.0      
[13] xtable_1.8-0          GO.db_3.1.2           scales_0.4.0         
[16] gdata_2.17.0          affyio_1.36.0         annotate_1.46.1      
[19] nnet_7.3-11           foreign_0.8-66        tools_3.2.0          
[22] munsell_0.4.3         locfit_1.5-9.1        lambda.r_1.1.7       
[25] caTools_1.17.1        futile.logger_1.4.1   grid_3.2.0           
[28] RCurl_1.95-4.7        bitops_1.0-6          gtable_0.2.0         
[31] codetools_0.2-14      gridExtra_2.0.0       rtracklayer_1.28.10  
[34] futile.options_1.0.0  KernSmooth_2.23-15    geneplotter_1.46.0   
[37] rpart_4.1-10          acepack_1.3-3.3      

 

ADD COMMENTlink modified 13 months ago by Valerie Obenchain ♦♦ 6.4k • written 14 months ago by gokce.ouz20
5
gravatar for Johannes Rainer
14 months ago by
Johannes Rainer1.0k
Italy
Johannes Rainer1.0k wrote:

You could also use ensembldb to do the mapping between Ensembl gene IDs and gene names (or symbols). You would need also one of the EnsDb packages providing the actual annotation (such as EnsDb.Hsapiens.v75 for genome release GRCh37 or EnsDb.Hsapiens.v79 vor GRCh38). Check the ensembldb vignette for more information (http://www.bioconductor.org/packages/release/bioc/vignettes/ensembldb/inst/doc/ensembldb.html).

You could basically use the same AnnotationDbi call that you use, but provide the EnsDB object instead of the

org.Hs.eg.db.

Just one clarification: the gene names that are listed above in your table are not gene symbols. These are rather the names for the gene that are provided by Ensembl. For protein coding genes the gene names correspond however to the HGNC symbols.

Hope this helps.

ADD COMMENTlink written 14 months ago by Johannes Rainer1.0k

Thanks a lot for the suggestion and clarification Johannes. I will implement it to my analysis as soon as I solve my R version  problem.

Actually when I was running the code with org.Hs.eg.db, I was expecting to see all the corresponding HGNC symbols but it did not return which actually surprised me. So now when I run using EnsDb.Hsapiens.v79, it will return gene names or gene symbols ?

Best regards,

Gokce

ADD REPLYlink written 14 months ago by gokce.ouz20
1

EnsDb.Hsapiens.v79 will return you the gene names (even if you specify "SYMBOL"). I decided to go for the gene name in all cases, as that is species-independent.
 

ADD REPLYlink written 14 months ago by Johannes Rainer1.0k
3
gravatar for Valerie Obenchain
13 months ago by
Valerie Obenchain ♦♦ 6.4k
United States
Valerie Obenchain ♦♦ 6.4k wrote:

Hi,

The OrgDb packages are a collection of data from many different sources, NCBI, UCSC, Ensembl, etc. The packages are Entrez gene centric in that we start with the list of Entrez gene ids from NCBI and annotate to that id. Data downloaded from Ensembl is matched to the Entrez gene id, if no mapping between the two exists then the Ensembl id doesn't end up in the OrgDb package.

Taking the first 3 from your list as an example, 

ensemblGenes <- c("ENSG00000108958", "ENSG00000123009", "ENSG00000124399")
symbols <- c("AC016292.3", "NME2P1", "NDUFB4P12")

As you said, the OrgDb package doesn't have data for these Ensembl ids:

> select(org.Hs.eg.db,
+        key=ensemblGenes, columns=columns(org.Hs.eg.db),
+        keytype="ENSEMBL")
Error in .testForValidKeys(x, keys, keytype, fks) :
  None of the keys entered are valid keys for 'ENSEMBL'. Please use the keys method to see a listing of valid arguments.

Using the symbols instead, we see the OrgDb has data for some of these genes but no Ensembl -> Entrez id mapping:

> select(org.Hs.eg.db, key=symbols,
+        columns=c("ENTREZID", "ENSEMBL"),
+        keytype="SYMBOL")
'select()' returned 1:1 mapping between keys and columns
      SYMBOL ENTREZID ENSEMBL
1 AC016292.3     <NA>    <NA>
2     NME2P1   283458    <NA>
3  NDUFB4P12   402175    <NA>


biomaRt also shows no mapping between the two and is missing the symbol for the first gene:
library(biomaRt)
mart<- useDataset("hsapiens_gene_ensembl", useMart("ENSEMBL_MART_ENSEMBL"))

> getBM(filters="ensembl_gene_id",
+       attributes=c("ensembl_gene_id", "entrezgene",
+                    "hgnc_symbol"),
+       values=ensemblGenes,
+       mart=mart)
  ensembl_gene_id entrezgene hgnc_symbol
1 ENSG00000108958         NA           
2 ENSG00000123009         NA      NME2P1
3 ENSG00000124399         NA   NDUFB4P12
                                                                                   

Using Jo's EnsDb.Hsapiens.v79, it looks like the Ensembl id is called GENEID so we use that as 'keytype'. It also confirms no mapping between Ensembl and Entrez but it does return a value for all the symbols.

> select(EnsDb.Hsapiens.v79, key=ensemblGenes, 
+        columns=c("ENTREZID", "SYMBOL"), 
+        keytype="GENEID")
           GENEID ENTREZID       SYMBOL
1 ENSG00000108958            AC016292.3
2 ENSG00000123009                NME2P1
3 ENSG00000124399          RP11-663P9.2

So for the task of mapping Ensembl ids to gene symbols it looks like Jo's ensembl package is the most comprehensive. 

Valerie
 

ADD COMMENTlink modified 13 months ago • written 13 months ago by Valerie Obenchain ♦♦ 6.4k

I really appreciate for your detailed answer Valerie.

ADD REPLYlink written 13 months ago by gokce.ouz20
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.2.0
Traffic: 210 users visited in the last hour