Dear Bioconductor community,
based on a recently acquired RNASeq data experiment from human samples, for downstream data analysis I tried initially to annotate the Ensembl ids to gene symbols; due to the fact for the respective alignment gencode hg19 was used, I initially tried to utilize the BioMart database through the respective R package:
ensembl.genes <- as.character(rownames(dt.counts)) head(ensembl.genes)  "ENSG00000223972" "ENSG00000227232" "ENSG00000243485" "ENSG00000237613"  "ENSG00000268020" "ENSG00000240361" ensembl.mart <- useMart( biomart = 'ENSEMBL_MART_ENSEMBL', host = 'grch37.ensembl.org', path = '/biomart/martservice', dataset = 'hsapiens_gene_ensembl') annot <- getBM( attributes = c("hgnc_symbol",'ensembl_gene_id','gene_biotype'), filters = 'ensembl_gene_id', values = ensembl.genes, mart = ensembl.mart) final.annot <- annot %>% as_tibble() %>% filter(!is.na(hgnc_symbol)) %>% filter(gene_biotype == "protein_coding") %>% dplyr::distinct(hgnc_symbol,.keep_all=TRUE) sessionInfo() R version 4.0.3 (2020-10-10) Platform: x86_64-w64-mingw32/x64 (64-bit) Running under: Windows 10 x64 (build 18363) Matrix products: default locale:  LC_COLLATE=English_United States.1252 LC_CTYPE=English_United States.1252  LC_MONETARY=English_United States.1252 LC_NUMERIC=C  LC_TIME=English_United States.1252 attached base packages:  parallel stats4 stats graphics grDevices utils datasets methods  base other attached packages:  org.Hs.eg.db_3.12.0 AnnotationDbi_1.52.0 IRanges_2.24.1  S4Vectors_0.28.1 Biobase_2.50.0 BiocGenerics_0.36.1  clusterProfiler_3.18.1 readxl_1.3.1 biomaRt_2.46.3  data.table_1.14.0 vroom_1.4.0 forcats_0.5.1  stringr_1.4.0 dplyr_1.0.5 purrr_0.3.4  readr_1.4.0 tidyr_1.1.3 tibble_3.1.0  ggplot2_3.3.3 tidyverse_1.3.1
However, accidentally when trying to replace the IDs with their respective gene symbols in the data matrix, I noticed the following issue in a specific line:
head(final.annot) # A tibble: 6 x 3 hgnc_symbol ensembl_gene_id gene_biotype <chr> <chr> <chr> 1 SCYL3 ENSG00000000457 protein_coding 2 C1orf112 ENSG00000000460 protein_coding 3 FGR ENSG00000000938 protein_coding 4 CFH ENSG00000000971 protein_coding 5 STPG1 ENSG00000001460 protein_coding 6 NIPAL3 ENSG00000001461 protein_coding final.annot %>% as_tibble() %>% .[297,] # A tibble: 1 x 3 hgnc_symbol ensembl_gene_id gene_biotype <chr> <chr> <chr> 1 "" ENSG00000116883 protein_coding
Thus, in the above line it does not seem like a "NA" value, as previously removed any NA symbols, but something as an empty character-is that possible or usual ?
In addition, for an alternative more "easy" way of annotation to gene symbols, could I use directly the
mapIds function, even without specifying any hg19 reference annotation ?
Thank you in advance,