Ensembl archive challenges
yunadal • 0
Last seen 8 months ago

Hello all,

I am having a lot of trouble accessing biomaRt/ Ensembl at the moment

I have GRCh38 chr:pos SNPs that I would like the db151 rsIDs for, and to this end have been trying poll Version 96 of Ensembl.

This has worked once earlier today for 10 SNPs: 7:128935744:128935744, 3:119518880:119518880, 12:132463597:132463597, 4:40305571:40305571, 17:45379362:45379362, 6:32143247:32143247, 2:191084261:191084261, 6:32286483:32286483, 6:31331721:31331721, 6:32714358:32714358,

Subsequently, I have been unable to alter the attributes I seek, as it returns

Error: biomaRt has encountered an unexpected server error.
Consider trying one of the Ensembl mirrors (for more details look at ?useEnsembl)

I cannot use a different mirror, as it is an archived version of Ensembl.

In an attempt to debug I have moved to the most modern version of Ensembl, and am now getting

Error in curl::curl_fetch_memory(url, handle = handle) : 
  Timeout was reached: [asia.ensembl.org:443] Operation timed out after 300004 milliseconds with 0 bytes received

despite trying different mirrors (www, asia, useast)

The current pipeline for GRCh38 SNPs returning dbSNP154 rsIDs is:


SLE <- get_variants(efo_id = "EFO_0002690")

SLEsnps <- c(paste(SLE@variants[1,4][[1]], SLE@variants[1,5][[1]], SLE@variants[1,5][[1]], sep = ":"))
for (i in 2:10){
  SLEsnps <- append(SLEsnps, paste(SLE@variants[i,4][[1]], SLE@variants[i,5][[1]], SLE@variants[i,5][[1]], sep = ":"))
ensembl <- useEnsembl(biomart = 'snps', dataset = 'hsapiens_snp', mirror = "www")
getBM(attributes = c("refsnp_id"), #stable code
       filters = c("chromosomal_region"),
       values = list(SLEsnps), 
       mart = ensembl)

Obviously for dbSNP151 rsIDs I would use ensembl <- useEnsembl(biomart = 'snps', dataset = 'hsapiens_snp', version = 96)

What on earth am I doing wrong to have recurrent time-outs and internal server errors?

sessionInfo( )

R version 4.0.4 (2021-02-15)
Platform: x86_64-apple-darwin17.0 (64-bit)
Running under: macOS Big Sur 10.16

Matrix products: default
LAPACK: /Library/Frameworks/R.framework/Versions/4.0/Resources/lib/libRlapack.dylib

[1] en_AU.UTF-8/en_AU.UTF-8/en_AU.UTF-8/C/en_AU.UTF-8/en_AU.UTF-8

attached base packages:
[1] stats4    parallel  stats     graphics  grDevices utils     datasets  methods  
[9] base     

other attached packages:
 [1] BSgenome_1.58.0      rtracklayer_1.50.0   Biostrings_2.58.0   
 [4] XVector_0.30.0       GenomicRanges_1.42.0 GenomeInfoDb_1.26.7 
 [7] IRanges_2.24.1       S4Vectors_0.28.1     BiocGenerics_0.36.1 
[10] Matrix_1.3-4         biomaRt_2.46.3       gwasrapidd_0.99.11  

loaded via a namespace (and not attached):
 [1] Rcpp_1.0.7                  lattice_0.20-44            
 [3] prettyunits_1.1.1           Rsamtools_2.6.0            
 [5] assertthat_0.2.1            utf8_1.2.2                 
 [7] BiocFileCache_1.14.0        R6_2.5.1                   
 [9] RSQLite_2.2.8               httr_1.4.2                 
[11] pillar_1.6.2                zlibbioc_1.36.0            
[13] rlang_0.4.11                progress_1.2.2             
[15] curl_4.3.2                  rstudioapi_0.13            
[17] blob_1.2.2                  BiocParallel_1.24.1        
[19] stringr_1.4.0               RCurl_1.98-1.4             
[21] bit_4.0.4                   tinytex_0.33               
[23] DelayedArray_0.16.3         compiler_4.0.4             
[25] xfun_0.25                   pkgconfig_2.0.3            
[27] askpass_1.1                 SummarizedExperiment_1.20.0
[29] openssl_1.4.5               tidyselect_1.1.1           
[31] tibble_3.1.4                GenomeInfoDbData_1.2.4     
[33] matrixStats_0.60.1          XML_3.99-0.7               
[35] fansi_0.5.0                 withr_2.4.2                
[37] crayon_1.4.1                dplyr_1.0.7                
[39] dbplyr_2.1.1                GenomicAlignments_1.26.0   
[41] bitops_1.0-7                rappdirs_0.3.3             
[43] grid_4.0.4                  lifecycle_1.0.0            
[45] DBI_1.1.1                   magrittr_2.0.1             
[47] cli_3.0.1                   stringi_1.7.4              
[49] cachem_1.0.6                xml2_1.3.2                 
[51] ellipsis_0.3.2              generics_0.1.0             
[53] vctrs_0.3.8                 tools_4.0.4                
[55] bit64_4.0.5                 Biobase_2.50.0             
[57] glue_1.4.2                  purrr_0.3.4                
[59] MatrixGenerics_1.2.1        hms_1.1.0                  
[61] fastmap_1.1.0               AnnotationDbi_1.52.0       
[63] BiocManager_1.30.16         memoise_2.0.0
biomaRt
Hey team,

Brief update --

I think the errors I am having are related to Ensembl (?) server load and time-outs

By limiting the request to 5 SNPs I get reliable responses on Ensembl 104 (most recent build), with no time-outs

Unfortunately, using archived versions is still a bit flaky. e.g. useEnsembl(biomart = 'snps', dataset = 'hsapiens_snp', version = 96) throws an internal server error after trying several servers, but useEnsembl(biomart = 'snps', dataset = 'hsapiens_snp', version = 95) works fine for the useEnsemble() portion, but then throws Error: biomaRt has encountered an unexpected server error. when a getBM() query is submitted for 5 SNPs. It works fine for a single SNP.

All a little odd -- is biomaRt usually this limited in its throughput?

Mike Smith ★ 5.5k
Last seen 3 hours ago
EMBL Heidelberg / de.NBI

Your query looks fine, and I'm afraid I don't think there's much you can actually do to make this work faster when you're querying the most recent Ensembl build. Ensembl BioMart is a complicated tool, and it's not easy to predict performance. It looks to me like this particular combination of filters, attributes and the human SNP dataset is very slow to run a query. You see the same slowness if trying to do the query in a web browser rather than via biomaRt. I wish I could give you an explaination of why it is so slow, but BioMart is pretty opaque regarding the operations it's carrying out in the background. Normally I'd advise that running a single query with multiple values is more efficient than lots of small queries. However, if you hit BioMart's 5 minute time limit you get nothing back, and it seems like that happens for even a very small number of query values. As you've figured out, it's probably most reliable to run individual queries, but I expect it will still be painfully slow.

Regarding the "Internal Error 500" when trying version 96, I get the same problem when visiting that archive page (http://apr2019.archive.ensembl.org/index.html) in a browser. It seems like that entire Ensembl archive is offline at the moment, so you won't be able to connect to the relevant BioMart regardless of the query you want to run.


