We have a workflow in my current institute that depends on the biomaRt
package to get gene start positions for a given set of gene symbols. However, every now and then Ensembl connection fails and this causes a break in the pipeline. I would like to replace this step with the AnnotationHub
method, so we can cache this information and don't rely on Ensembl anymore.
For some reason, for a given set of HGNC symbols, I get more hits from biomaRt
compared to AnnotationHub
. I am wondering how I can mimic the biomaRt
steps with AnnotationHub
, so I can get the same gene list back? I would also appreciate any suggestions to make my AnnotationHub
code cleaner. Thanks in advance!
Here's a toy example:
Say, the following are my 100 gene symbols:
gene_names <- c("PHKA1P1", "RPL27P7", "RNA5SP210", "C21orf91", "PPP1R26P4",
"OR5BL1P", "LINC01348", "TPT1P13", "P3H4", "EGLN3P1", "ATP1B2",
"LINC00471", "GLT8D1", "DGKZP1", "RNU105C", "CDRT2", "GEMIN8P1",
"RNY5P10", "TOPBP1", "TRE-CTC6-1", "RPS12P5", "KIF3A", "RNA5SP259",
"YAP1P2", "DDX18P2", "ITCH", "TBC1D3C", "ERCC6L2", "USP2",
"TMEM108-AS1", "OR2A9P", "JAKMIP2", "MIR155HG", "MIR7844",
"MRPS15", "TMEM132A", "MYL6P1", "DCAF8", "SLC31A1P1", "PRDX4",
"FIZ1", "PSMD9", "KCNJ13", "RPL23AP33", "PIRC63", "ZC2HC1B",
"RPS25P6", "RPSAP42", "RNU7-89P", "KAT2B", "SP6", "SLC6A6",
"CLDN4", "TRBV3-2", "CSMD2-AS1", "MIR7112", "SENCR", "ARL2BPP1",
"RAB1A", "SNHG9", "POM121L7", "RNASE12", "CFAP74", "HHIP", "KRT8P6",
"TIFA", "LINC01349", "MIR3190", "KRTAP5-6", "HLTF", "MORC3",
"MIR1281", "RASGEF1A", "RBM26", "ZNF286A", "OR2A15P", "NHP2",
"MIR3910-1", "MRPL30", "TPP1", "ADAM5", "ZNF826P", "RNASE11",
"IGSF10", "NEK1", "KRTAP5-3", "VN1R112P", "OR51A3P", "CLCC1",
"RBMY2WP", "CT47A6", "LGALS14", "CBFB", "ANKRD49", "TRBV7-2",
"OSBPL2", "SLC9A9", "KLF15", "OPN1LW", "SCCPDH")
length(gene_names)
Here's how I get the gene start position with biomaRt
, which returns 89 genes:
library(biomaRt)
ensembl <- useMart("ensembl", dataset="hsapiens_gene_ensembl", host = "uswest.ensembl.org")
k <- getBM(attributes = c('hgnc_symbol', 'chromosome_name', 'start_position'), filters = 'hgnc_symbol', values=gene_names, mart=ensembl)
# some additional steps to normalize chr names
k$chromosome_name[k$chromosome_name == "X"] <- 23
k$chromosome_name[k$chromosome_name == "Y"] <- 24
k$chromosome_name <- as.numeric(as.character(k$chromosome_name))
k <- k[complete.cases(k),]
# check number of returned genes
nrow(k)
length(unique(k$hgnc_symbol))
Here's my attempt to create the same data.frame with AnnotationHub
, which returns 64 genes:
library(AnnotationHub)
ah <- AnnotationHub()
bmrt <- query(ah, "BioMart")
txdb <- bmrt[["AH52256"]]
geneid <- bmrt[["AH69106"]]
colnames(geneid) <- c("entrezid", "hgnc_symbol")
seqlevelsStyle(txdb) <- "NCBI"
gr <- genes(txdb, columns=c("TXCHROM"), filter=list("gene_id"=paste0("GeneID:", geneid$entrezid[geneid$hgnc_symbol %in% gene_names])))
gr <- keepSeqlevels(gr, value = c(1:22, "X", "Y"))
genepos_df <- data.frame(chromosome_name=as.factor(as.character(mcols(gr)$TXCHROM)),
start_position=start(gr),
entrezid=gsub("GeneID:","", gr@ranges@NAMES))
genepos_df <- merge(geneid, genepos_df)
levels(genepos_df$chromosome_name) <- 1:24
genepos_df$chromosome_name <- as.character(genepos_df$chromosome_name)
genepos_df$entrezid <- NULL
nrow(genepos_df)
And my session info:
sessionInfo( )
R version 4.0.2 (2020-06-22)
Platform: x86_64-apple-darwin17.0 (64-bit)
Running under: macOS Catalina 10.15.6
Matrix products: default
BLAS: /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/4.0/Resources/lib/libRlapack.dylib
locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
attached base packages:
[1] stats4 parallel stats graphics grDevices utils datasets methods base
other attached packages:
[1] GenomicFeatures_1.42.0 AnnotationDbi_1.52.0 Biobase_2.50.0 GenomicRanges_1.42.0 GenomeInfoDb_1.26.0 IRanges_2.24.0 S4Vectors_0.28.0
[8] AnnotationHub_2.22.0 BiocFileCache_1.14.0 dbplyr_1.4.4 BiocGenerics_0.36.0 biomaRt_2.46.0
loaded via a namespace (and not attached):
[1] Rcpp_1.0.5 lattice_0.20-41 prettyunits_1.1.1 Rsamtools_2.6.0 Biostrings_2.58.0
[6] assertthat_0.2.1 digest_0.6.25 mime_0.9 R6_2.4.1 RSQLite_2.2.1
[11] httr_1.4.2 pillar_1.4.6 zlibbioc_1.36.0 rlang_0.4.7 progress_1.2.2
[16] curl_4.3 rstudioapi_0.11 blob_1.2.1 Matrix_1.2-18 BiocParallel_1.24.0
[21] stringr_1.4.0 RCurl_1.98-1.2 bit_4.0.4 DelayedArray_0.16.0 shiny_1.5.0
[26] compiler_4.0.2 httpuv_1.5.4 rtracklayer_1.50.0 pkgconfig_2.0.3 askpass_1.1
[31] htmltools_0.5.0 SummarizedExperiment_1.20.0 openssl_1.4.2 tidyselect_1.1.0 tibble_3.0.3
[36] GenomeInfoDbData_1.2.4 interactiveDisplayBase_1.28.0 matrixStats_0.57.0 XML_3.99-0.5 crayon_1.3.4
[41] dplyr_1.0.1 later_1.1.0.1 GenomicAlignments_1.26.0 bitops_1.0-6 rappdirs_0.3.1
[46] grid_4.0.2 xtable_1.8-4 lifecycle_0.2.0 DBI_1.1.0 magrittr_1.5
[51] stringi_1.4.6 XVector_0.30.0 promises_1.1.1 xml2_1.3.2 ellipsis_0.3.1
[56] generics_0.0.2 vctrs_0.3.2 tools_4.0.2 bit64_4.0.5 glue_1.4.1
[61] purrr_0.3.4 BiocVersion_3.12.0 MatrixGenerics_1.2.0 hms_0.5.3 fastmap_1.0.1
[66] yaml_2.2.1 BiocManager_1.30.10 memoise_1.1.0
Fantastic! I should have looked in more detail for Ensembl resources.
Also I would like to make a point here that, trying the following initially gave me an error:
Installing
ensmbldb
resolved the issue: