Question

Weird output of biomaRt R package querying Ensembl IDs (hg19/grch37) to retrieve gene symbols

0

Entering edit mode

svlachavas ▴ 830

@svlachavas-7225

Last seen 9 months ago

Germany/Heidelberg/German Cancer Resear…

Dear Bioconductor community,

based on a recently acquired RNASeq data experiment from human samples, for downstream data analysis I tried initially to annotate the Ensembl ids to gene symbols; due to the fact for the respective alignment gencode hg19 was used, I initially tried to utilize the BioMart database through the respective R package:


ensembl.genes <- as.character(rownames(dt.counts)) 

head(ensembl.genes)
[1] "ENSG00000223972" "ENSG00000227232" "ENSG00000243485" "ENSG00000237613"
[5] "ENSG00000268020" "ENSG00000240361"

ensembl.mart <- useMart(
  biomart = 'ENSEMBL_MART_ENSEMBL', 
  host    = 'grch37.ensembl.org',
  path    = '/biomart/martservice',
  dataset = 'hsapiens_gene_ensembl')


annot <- getBM(
  attributes = 
  c("hgnc_symbol",'ensembl_gene_id','gene_biotype'),
  filters = 'ensembl_gene_id',
  values = ensembl.genes,
  mart = ensembl.mart)


final.annot <- annot %>% as_tibble() %>%
filter(!is.na(hgnc_symbol)) %>% 
filter(gene_biotype == "protein_coding") %>%
dplyr::distinct(hgnc_symbol,.keep_all=TRUE)

sessionInfo()
R version 4.0.3 (2020-10-10)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 18363)

Matrix products: default

locale:
[1] LC_COLLATE=English_United States.1252  LC_CTYPE=English_United States.1252   
[3] LC_MONETARY=English_United States.1252 LC_NUMERIC=C                          
[5] LC_TIME=English_United States.1252    

attached base packages:
[1] parallel  stats4    stats     graphics  grDevices utils     datasets  methods  
[9] base     

other attached packages:
 [1] org.Hs.eg.db_3.12.0    AnnotationDbi_1.52.0   IRanges_2.24.1        
 [4] S4Vectors_0.28.1       Biobase_2.50.0         BiocGenerics_0.36.1   
 [7] clusterProfiler_3.18.1 readxl_1.3.1           biomaRt_2.46.3        
[10] data.table_1.14.0      vroom_1.4.0            forcats_0.5.1         
[13] stringr_1.4.0          dplyr_1.0.5            purrr_0.3.4           
[16] readr_1.4.0            tidyr_1.1.3            tibble_3.1.0          
[19] ggplot2_3.3.3          tidyverse_1.3.1

However, accidentally when trying to replace the IDs with their respective gene symbols in the data matrix, I noticed the following issue in a specific line:


head(final.annot)
# A tibble: 6 x 3
  hgnc_symbol ensembl_gene_id gene_biotype  
  <chr>       <chr>           <chr>         
1 SCYL3       ENSG00000000457 protein_coding
2 C1orf112    ENSG00000000460 protein_coding
3 FGR         ENSG00000000938 protein_coding
4 CFH         ENSG00000000971 protein_coding
5 STPG1       ENSG00000001460 protein_coding
6 NIPAL3      ENSG00000001461 protein_coding

final.annot %>% as_tibble() %>% .[297,]
# A tibble: 1 x 3
  hgnc_symbol ensembl_gene_id gene_biotype  
  <chr>       <chr>           <chr>         
1 ""          ENSG00000116883 protein_coding

Thus, in the above line it does not seem like a "NA" value, as previously removed any NA symbols, but something as an empty character-is that possible or usual ?

In addition, for an alternative more "easy" way of annotation to gene symbols, could I use directly the mapIds function, even without specifying any hg19 reference annotation ?

Thank you in advance,

Efstathios

biomaRt annnotation RNASeq AnnotationDbi • 2.5k views

ADD COMMENT • link 3.2 years ago svlachavas ▴ 830

score 3 · Answer 1 · 2021-05-19

3

Entering edit mode

James W. MacDonald 66k

@james-w-macdonald-5106

Last seen 1 hour ago

United States

I don't know anything about tidyverse stuff, so can't speak to that, but it's simple to check things directly.

> library(biomaRt)
> ensembl.mart <- useMart(
 +   biomart = 'ENSEMBL_MART_ENSEMBL', 
+   host    = 'grch37.ensembl.org',
+   path    = '/biomart/martservice',
+   dataset = 'hsapiens_gene_ensembl')

> getBM(c("hgnc_symbol","ensembl_gene_id","gene_biotype"), "ensembl_gene_id", "ENSG00000116883", ensembl.mart)
  hgnc_symbol ensembl_gene_id   gene_biotype
1          NA ENSG00000116883 protein_coding

Which seems to answer the question.

ADD COMMENT • link 3.2 years ago James W. MacDonald 66k

0

Entering edit mode

Dear James,

thank you very much for your valuable comment !! Indeed it seems something is not optimal with tidyverse, so I would check why this is happening;

one last question-concerning mapIds function, it should work similarly despite hg19 annotation ?

ADD REPLY • link 3.2 years ago svlachavas ▴ 830

0

Entering edit mode

I don't think mapIds has a method for Mart objects. There is one for select however.

> showMethods(select)
Function: select (package AnnotationDbi)
x="ChipDb"
x="GODb"
x="Inparanoid8Db"
x="InparanoidDb"
x="Mart"
x="OrgDb"
x="ReactomeDb"

> selectMethod(select, "Mart")
Method Definition:

function (x, keys, columns, keytype, ...) 
{
    if (missing(columns)) 
        stop("Argument 'columns' must be specified.")
    if (!is.list(keytype) && keytype != "" && missing(keys)) 
        stop("Argument 'keys' must be specified.")
    if (length(keytype) > 0 && length(keys) == 0) 
        stop("Keys argument contains no data.")
    if (!(is.character(keytype)) || length(keytype) != 1) {
        stop("keytype should be single element character vector.")
    }
    getBM(attributes = columns, filters = keytype, values = keys, 
        mart = x)
}
<bytecode: 0x0000000020887f00>
<environment: namespace:biomaRt>

Signatures:
        x     
target  "Mart"
defined "Mart"

> showMethods(mapIds)
Function: mapIds (package AnnotationDbi)
x="AnnotationDb"

And just to check

> mart <- useEnsembl("ensembl","hsapiens_gene_ensembl")
> select(mart, "ENSG00000000457", "hgnc_symbol", "ensembl_gene_id")
  hgnc_symbol
1       SCYL3
> mapIds(mart, "ENSG00000000457", "hgnc_symbol", "ensembl_gene_id")
Error in (function (classes, fdef, mtable)  : 
  unable to find an inherited method for function 'mapIds' for signature '"Mart"'

So, as before, this is something that you could have checked yourself.

ADD REPLY • link 3.2 years ago James W. MacDonald 66k

0

Entering edit mode

Dear James,

thank you very much for your time spent !! Many apologies if I was not clear or correct about my intended question-as I have used a lot in the past mapIds but with hg38 reference genomes in downstream RNASeq analyses, I was wondering if using mapIds with an AnnotationDb object such as org.Hs.eg.db would be similarly accurate with biomart without specifying a certain reference annotation like hg19 or hg38 -just for curiosity:


gene.ids <- mapIds(org.Hs.eg.db, keys=ensembl.genes,
+                    keytype="ENSEMBL", column="SYMBOL")
'select()' returned 1:many mapping between keys and columns
head(gene.ids)
ENSG00000223972 ENSG00000227232 ENSG00000243485 ENSG00000237613 ENSG00000268020 
      "DDX11L2"        "WASH7P"              NA       "FAM138A"        "OR4G4P" 
ENSG00000240361 
             NA 
dt <- data.frame(ENSEMBL=ensembl.genes, SYMBOL=gene.ids)
head(dt)
                        ENSEMBL  SYMBOL
ENSG00000223972 ENSG00000223972 DDX11L2
ENSG00000227232 ENSG00000227232  WASH7P
ENSG00000243485 ENSG00000243485    <NA>
ENSG00000237613 ENSG00000237613 FAM138A
ENSG00000268020 ENSG00000268020  OR4G4P
ENSG00000240361 ENSG00000240361    <NA>

dt<- dt[!duplicated(dt$SYMBOL),]
dt <- dt[!is.na(dt$SYMBOL),]
head(dt)
                        ENSEMBL       SYMBOL
ENSG00000223972 ENSG00000223972      DDX11L2
ENSG00000227232 ENSG00000227232       WASH7P
ENSG00000237613 ENSG00000237613      FAM138A
ENSG00000268020 ENSG00000268020       OR4G4P
ENSG00000186092 ENSG00000186092        OR4F5
ENSG00000238009 ENSG00000238009 LOC100996442
c.ids <- intersect(final.annot$hgnc_symbol,dt$SYMBOL)

length(c.ids)
[1] 17709
dim(final.annot)
[1] 19215     3
dim(dt)
[1] 32934     2

ADD REPLY • link 3.2 years ago svlachavas ▴ 830

3

Entering edit mode

You are conflating things that are not related. The genome builds have to do with the structure of the genome and the location of various things like genes, exons, binding sites, etc. This has nothing to do with the names that HUGO has defined for genes! The only relationship is simply temporal - if you use an old version of Ensembl, you by definition will get the gene names that were in use at that time, because you are using a static archived database. But this has nothing to do with the genome version.

Doing something like that is only useful if you are working with someone who has outdated gene symbols and cannot understand that these things change. So as an example,

> select(org.Hs.eg.db, "1", "ALIAS")
'select()' returned 1:many mapping between keys and columns
  ENTREZID    ALIAS
1        1      A1B
2        1      ABG
3        1      GAB
4        1 HYST2477
5        1     A1BG

The current HGNC Gene Symbol for this NCBI Gene ID is A1BG. There are, however, four other deprecated gene symbols that people have used to describe that gene in the past. Now maybe you know someone who insists that it's GAB or HYST2477 or whatever, in which case you really need the old gene symbol. But the current, officially designated (so far as HUGO is the arbiter of such things) gene symbol is A1BG, and that's the end of it.

So it makes no sense to say things like 'I am using hg38 to annotate my gene symbols' because that's not a thing. HUGO updates the gene symbols on an ongoing basis, without regard to the genome build.

As for using the org.Hs.eg.db package to annotate Ensembl Gene IDs, that is suboptimal. You are relying on multiple mapping steps (Ensembl Gene ID -> NCBI Gene ID -> HGNC Gene Symbol) to do that, and there are any number of reasons why that first mapping step will fail. NCBI and Ensembl use different criteria to define what is and isn't a gene, where it is located in the genome, and how many transcripts it has. They also have rules that they use to map their IDs to the other annotation service (so you can get instances where Ensembl says ENSG0000000XXX is NCBI Gene ID XXXX, but NCBI doesn't have the reciprocal mapping because their rules say they aren't the same thing, and vice versa).

Unless you have a real need to do cross-mapping between NCBI and Ensembl, don't do it. If you have Ensembl Gene IDs, use an Ensembl-based database to annotate rather than trying to use an NCBI based database.

ADD REPLY • link 3.2 years ago James W. MacDonald 66k

0

Entering edit mode

Dear James, thank you one more time for your comprehensive answer and explanation !! Indeed very naively I was confused with gene symbol annotation and irrelevant genome builds; I will follow your advice and use either the initial search through bioMart, or directly through the ensembldb package;

Best,

Efstathios

ADD REPLY • link 3.2 years ago svlachavas ▴ 830