We have a workflow in my current institute that depends on the biomaRt package to get gene start positions for a given set of gene symbols. However, every now and then Ensembl connection fails and this causes a break in the pipeline. I would like to replace this step with the AnnotationHub method, so we can cache this information and don't rely on Ensembl anymore.
For some reason, for a given set of HGNC symbols, I get more hits from biomaRt compared to AnnotationHub. I am wondering how I can mimic the biomaRt steps with AnnotationHub, so I can get the same gene list back? I would also appreciate any suggestions to make my AnnotationHub code cleaner. Thanks in advance!
Here's a toy example:
Say, the following are my 100 gene symbols:
gene_names <- c("PHKA1P1", "RPL27P7", "RNA5SP210", "C21orf91", "PPP1R26P4", 
                "OR5BL1P", "LINC01348", "TPT1P13", "P3H4", "EGLN3P1", "ATP1B2", 
                "LINC00471", "GLT8D1", "DGKZP1", "RNU105C", "CDRT2", "GEMIN8P1", 
                "RNY5P10", "TOPBP1", "TRE-CTC6-1", "RPS12P5", "KIF3A", "RNA5SP259", 
                "YAP1P2", "DDX18P2", "ITCH", "TBC1D3C", "ERCC6L2", "USP2", 
                "TMEM108-AS1", "OR2A9P", "JAKMIP2", "MIR155HG", "MIR7844", 
                "MRPS15", "TMEM132A", "MYL6P1", "DCAF8", "SLC31A1P1", "PRDX4", 
                "FIZ1", "PSMD9", "KCNJ13", "RPL23AP33", "PIRC63", "ZC2HC1B", 
                "RPS25P6", "RPSAP42", "RNU7-89P", "KAT2B", "SP6", "SLC6A6", 
                "CLDN4", "TRBV3-2", "CSMD2-AS1", "MIR7112", "SENCR", "ARL2BPP1", 
                "RAB1A", "SNHG9", "POM121L7", "RNASE12", "CFAP74", "HHIP", "KRT8P6", 
                "TIFA", "LINC01349", "MIR3190", "KRTAP5-6", "HLTF", "MORC3", 
                "MIR1281", "RASGEF1A", "RBM26", "ZNF286A", "OR2A15P", "NHP2", 
                "MIR3910-1", "MRPL30", "TPP1", "ADAM5", "ZNF826P", "RNASE11", 
                "IGSF10", "NEK1", "KRTAP5-3", "VN1R112P", "OR51A3P", "CLCC1", 
                "RBMY2WP", "CT47A6", "LGALS14", "CBFB", "ANKRD49", "TRBV7-2", 
                "OSBPL2", "SLC9A9", "KLF15", "OPN1LW", "SCCPDH")
length(gene_names)
Here's how I get the gene start position with biomaRt, which returns 89 genes:
library(biomaRt)
ensembl <- useMart("ensembl", dataset="hsapiens_gene_ensembl", host = "uswest.ensembl.org")
k <- getBM(attributes = c('hgnc_symbol', 'chromosome_name', 'start_position'), filters = 'hgnc_symbol', values=gene_names, mart=ensembl)
# some additional steps to normalize chr names
k$chromosome_name[k$chromosome_name == "X"] <- 23
k$chromosome_name[k$chromosome_name == "Y"] <- 24
k$chromosome_name <- as.numeric(as.character(k$chromosome_name))
k <- k[complete.cases(k),]
# check number of returned genes
nrow(k)
length(unique(k$hgnc_symbol))
Here's my attempt to create the same data.frame with AnnotationHub, which returns 64 genes:
library(AnnotationHub)
ah <- AnnotationHub()
bmrt <- query(ah, "BioMart")
txdb <- bmrt[["AH52256"]]
geneid <- bmrt[["AH69106"]]
colnames(geneid) <- c("entrezid", "hgnc_symbol")
seqlevelsStyle(txdb) <- "NCBI"
gr <- genes(txdb, columns=c("TXCHROM"), filter=list("gene_id"=paste0("GeneID:", geneid$entrezid[geneid$hgnc_symbol %in% gene_names])))
gr <- keepSeqlevels(gr, value = c(1:22, "X", "Y"))
genepos_df <- data.frame(chromosome_name=as.factor(as.character(mcols(gr)$TXCHROM)),
                start_position=start(gr),
                entrezid=gsub("GeneID:","", gr@ranges@NAMES))
genepos_df <- merge(geneid, genepos_df)
levels(genepos_df$chromosome_name) <- 1:24
genepos_df$chromosome_name <- as.character(genepos_df$chromosome_name)
genepos_df$entrezid <- NULL
nrow(genepos_df)
And my session info:
sessionInfo( )
R version 4.0.2 (2020-06-22)
Platform: x86_64-apple-darwin17.0 (64-bit)
Running under: macOS Catalina 10.15.6
Matrix products: default
BLAS:   /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/4.0/Resources/lib/libRlapack.dylib
locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
attached base packages:
[1] stats4    parallel  stats     graphics  grDevices utils     datasets  methods   base     
other attached packages:
 [1] GenomicFeatures_1.42.0 AnnotationDbi_1.52.0   Biobase_2.50.0         GenomicRanges_1.42.0   GenomeInfoDb_1.26.0    IRanges_2.24.0         S4Vectors_0.28.0      
 [8] AnnotationHub_2.22.0   BiocFileCache_1.14.0   dbplyr_1.4.4           BiocGenerics_0.36.0    biomaRt_2.46.0        
loaded via a namespace (and not attached):
 [1] Rcpp_1.0.5                    lattice_0.20-41               prettyunits_1.1.1             Rsamtools_2.6.0               Biostrings_2.58.0            
 [6] assertthat_0.2.1              digest_0.6.25                 mime_0.9                      R6_2.4.1                      RSQLite_2.2.1                
[11] httr_1.4.2                    pillar_1.4.6                  zlibbioc_1.36.0               rlang_0.4.7                   progress_1.2.2               
[16] curl_4.3                      rstudioapi_0.11               blob_1.2.1                    Matrix_1.2-18                 BiocParallel_1.24.0          
[21] stringr_1.4.0                 RCurl_1.98-1.2                bit_4.0.4                     DelayedArray_0.16.0           shiny_1.5.0                  
[26] compiler_4.0.2                httpuv_1.5.4                  rtracklayer_1.50.0            pkgconfig_2.0.3               askpass_1.1                  
[31] htmltools_0.5.0               SummarizedExperiment_1.20.0   openssl_1.4.2                 tidyselect_1.1.0              tibble_3.0.3                 
[36] GenomeInfoDbData_1.2.4        interactiveDisplayBase_1.28.0 matrixStats_0.57.0            XML_3.99-0.5                  crayon_1.3.4                 
[41] dplyr_1.0.1                   later_1.1.0.1                 GenomicAlignments_1.26.0      bitops_1.0-6                  rappdirs_0.3.1               
[46] grid_4.0.2                    xtable_1.8-4                  lifecycle_0.2.0               DBI_1.1.0                     magrittr_1.5                 
[51] stringi_1.4.6                 XVector_0.30.0                promises_1.1.1                xml2_1.3.2                    ellipsis_0.3.1               
[56] generics_0.0.2                vctrs_0.3.2                   tools_4.0.2                   bit64_4.0.5                   glue_1.4.1                   
[61] purrr_0.3.4                   BiocVersion_3.12.0            MatrixGenerics_1.2.0          hms_0.5.3                     fastmap_1.0.1                
[66] yaml_2.2.1                    BiocManager_1.30.10           memoise_1.1.0
                    
                
                
Fantastic! I should have looked in more detail for Ensembl resources.
Also I would like to make a point here that, trying the following initially gave me an error:
Installing
ensmbldbresolved the issue: