Dear Bioconductor community,
based on a recently acquired RNASeq data experiment from human samples, for downstream data analysis I tried initially to annotate the Ensembl ids to gene symbols; due to the fact for the respective alignment gencode hg19 was used, I initially tried to utilize the BioMart database through the respective R package:
ensembl.genes <- as.character(rownames(dt.counts))
head(ensembl.genes)
[1] "ENSG00000223972" "ENSG00000227232" "ENSG00000243485" "ENSG00000237613"
[5] "ENSG00000268020" "ENSG00000240361"
ensembl.mart <- useMart(
biomart = 'ENSEMBL_MART_ENSEMBL',
host = 'grch37.ensembl.org',
path = '/biomart/martservice',
dataset = 'hsapiens_gene_ensembl')
annot <- getBM(
attributes =
c("hgnc_symbol",'ensembl_gene_id','gene_biotype'),
filters = 'ensembl_gene_id',
values = ensembl.genes,
mart = ensembl.mart)
final.annot <- annot %>% as_tibble() %>%
filter(!is.na(hgnc_symbol)) %>%
filter(gene_biotype == "protein_coding") %>%
dplyr::distinct(hgnc_symbol,.keep_all=TRUE)
sessionInfo()
R version 4.0.3 (2020-10-10)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 18363)
Matrix products: default
locale:
[1] LC_COLLATE=English_United States.1252 LC_CTYPE=English_United States.1252
[3] LC_MONETARY=English_United States.1252 LC_NUMERIC=C
[5] LC_TIME=English_United States.1252
attached base packages:
[1] parallel stats4 stats graphics grDevices utils datasets methods
[9] base
other attached packages:
[1] org.Hs.eg.db_3.12.0 AnnotationDbi_1.52.0 IRanges_2.24.1
[4] S4Vectors_0.28.1 Biobase_2.50.0 BiocGenerics_0.36.1
[7] clusterProfiler_3.18.1 readxl_1.3.1 biomaRt_2.46.3
[10] data.table_1.14.0 vroom_1.4.0 forcats_0.5.1
[13] stringr_1.4.0 dplyr_1.0.5 purrr_0.3.4
[16] readr_1.4.0 tidyr_1.1.3 tibble_3.1.0
[19] ggplot2_3.3.3 tidyverse_1.3.1
However, accidentally when trying to replace the IDs with their respective gene symbols in the data matrix, I noticed the following issue in a specific line:
head(final.annot)
# A tibble: 6 x 3
hgnc_symbol ensembl_gene_id gene_biotype
<chr> <chr> <chr>
1 SCYL3 ENSG00000000457 protein_coding
2 C1orf112 ENSG00000000460 protein_coding
3 FGR ENSG00000000938 protein_coding
4 CFH ENSG00000000971 protein_coding
5 STPG1 ENSG00000001460 protein_coding
6 NIPAL3 ENSG00000001461 protein_coding
final.annot %>% as_tibble() %>% .[297,]
# A tibble: 1 x 3
hgnc_symbol ensembl_gene_id gene_biotype
<chr> <chr> <chr>
1 "" ENSG00000116883 protein_coding
Thus, in the above line it does not seem like a "NA" value, as previously removed any NA symbols, but something as an empty character-is that possible or usual ?
In addition, for an alternative more "easy" way of annotation to gene symbols, could I use directly the mapIds
function, even without specifying any hg19 reference annotation ?
Thank you in advance,
Efstathios
Dear James,
thank you very much for your valuable comment !! Indeed it seems something is not optimal with tidyverse, so I would check why this is happening;
one last question-concerning
mapIds
function, it should work similarly despite hg19 annotation ?I don't think
mapIds
has a method forMart
objects. There is one forselect
however.And just to check
So, as before, this is something that you could have checked yourself.
Dear James,
thank you very much for your time spent !! Many apologies if I was not clear or correct about my intended question-as I have used a lot in the past
mapIds
but with hg38 reference genomes in downstream RNASeq analyses, I was wondering if usingmapIds
with an AnnotationDb object such asorg.Hs.eg.db
would be similarly accurate with biomart without specifying a certain reference annotation like hg19 or hg38 -just for curiosity:You are conflating things that are not related. The genome builds have to do with the structure of the genome and the location of various things like genes, exons, binding sites, etc. This has nothing to do with the names that HUGO has defined for genes! The only relationship is simply temporal - if you use an old version of Ensembl, you by definition will get the gene names that were in use at that time, because you are using a static archived database. But this has nothing to do with the genome version.
Doing something like that is only useful if you are working with someone who has outdated gene symbols and cannot understand that these things change. So as an example,
The current HGNC Gene Symbol for this NCBI Gene ID is A1BG. There are, however, four other deprecated gene symbols that people have used to describe that gene in the past. Now maybe you know someone who insists that it's GAB or HYST2477 or whatever, in which case you really need the old gene symbol. But the current, officially designated (so far as HUGO is the arbiter of such things) gene symbol is A1BG, and that's the end of it.
So it makes no sense to say things like 'I am using hg38 to annotate my gene symbols' because that's not a thing. HUGO updates the gene symbols on an ongoing basis, without regard to the genome build.
As for using the
org.Hs.eg.db
package to annotate Ensembl Gene IDs, that is suboptimal. You are relying on multiple mapping steps (Ensembl Gene ID -> NCBI Gene ID -> HGNC Gene Symbol) to do that, and there are any number of reasons why that first mapping step will fail. NCBI and Ensembl use different criteria to define what is and isn't a gene, where it is located in the genome, and how many transcripts it has. They also have rules that they use to map their IDs to the other annotation service (so you can get instances where Ensembl says ENSG0000000XXX is NCBI Gene ID XXXX, but NCBI doesn't have the reciprocal mapping because their rules say they aren't the same thing, and vice versa).Unless you have a real need to do cross-mapping between NCBI and Ensembl, don't do it. If you have Ensembl Gene IDs, use an Ensembl-based database to annotate rather than trying to use an NCBI based database.
Dear James, thank you one more time for your comprehensive answer and explanation !! Indeed very naively I was confused with gene symbol annotation and irrelevant genome builds; I will follow your advice and use either the initial search through bioMart, or directly through the ensembldb package;
Best,
Efstathios