Very slow select to extract data from UniProt.ws object
1
1
Entering edit mode
alxcho ▴ 10
@alxcho-16189
Last seen 3.3 years ago
Hi,

I’ve been using the UniProt.ws since a while now, never really had issues with it until now. When I now try to extract some data using select, it works fine and retrieve data correctly, but it is now so slow that it's impossible to use. It’s never been really fast, but it was more than fast enough for several hundred /thousand of keys. Currently, a quick test with microbenchmark shows that it takes select about 5/12/16s to extract 8 columns for 1/8/15 keys. I did the test with so little keys because for higher numbers it’s just hanging. I tried to update all packages and the rest, but I didn't see any changes. What am I missing? Thanks!


Here’s a quick version of the code I used to test:

library("microbenchmark")
libraryUniProt.ws)

UniprotDB <- UniProt.ws(10090)
For_uniprot_Mapping <- c("Q3TX55", "S4R267","A2A9P6", "D3YXA2", "Q80TM9", "Q6RT24", "Q5RJH6", "Q8JZM", "Q00519", "Q9D0B8","Q9WUM5","Q8R3E3", "R4GML0","E9PV80","Q8BJF9")
Uniprot_search1 <- c("PROTEIN-NAMES", "GENES","ENTRY-NAME", "UNIGENE", "ENSEMBL", "ENTREZ_GENE", "LENGTH", "SCORE")

microbenchmark  (
  select(UniprotDB, keys = For_uniprot_Mapping[1:2], columns = Uniprot_search1, keytype = "UNIPROTKB"),
  select(UniprotDB, keys = For_uniprot_Mapping[1:8], columns = Uniprot_search1, keytype = "UNIPROTKB"),
  select(UniprotDB, keys = For_uniprot_Mapping[1:15], columns = Uniprot_search1, keytype = "UNIPROTKB"),
  times = 3)

The session info is as follow:

> sessionInfo()
R version 3.5.0 (2018-04-23)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 7 x64 (build 7601) Service Pack 1

Matrix products: default

 

attached base packages:
[1] tcltk     parallel  stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] microbenchmark_1.4-4  splitstackshape_1.4.4 UniProt.ws_2.20.0     BiocGenerics_0.26.0   RCurl_1.95-4.10     
[6] bitops_1.0-6          RSQLite_2.1.1        

loaded via a namespace (and not attached):
[1] Rcpp_0.12.17         plyr_1.8.4           compiler_3.5.0       pillar_1.2.3         dbplyr_1.2.1       
[6] bindr_0.1.1          tools_3.5.0          digest_0.6.15        bit_1.1-14           gtable_0.2.0       
[11] BiocFileCache_1.4.0  memoise_1.1.0        tibble_1.4.2         pkgconfig_2.0.1      rlang_0.2.1        
[16] DBI_1.0.0            bindrcpp_0.2.2       dplyr_0.7.5          httr_1.3.1           S4Vectors_0.18.3   
[21] IRanges_2.14.10      rappdirs_0.3.1       grid_3.5.0           stats4_3.5.0         bit64_0.9-7        
[26] tidyselect_0.2.4     glue_1.2.0           Biobase_2.40.0       data.table_1.11.4    R6_2.2.2           
[31] AnnotationDbi_1.42.1 ggplot2_2.2.1        purrr_0.2.5          blob_1.1.1           magrittr_1.5       
[36] scale
s_0.5.0         assertthat_0.2.0     colorspace_1.3-2     lazyeval_0.2.1       munsell_0.5.0     
  

BiocInstaller::biocValid()
[1] TRUE
 
uniprot.ws select • 1.4k views
ADD COMMENT
0
Entering edit mode
@martin-morgan-1513
Last seen 5 days ago
United States

I think the problem is on the UniProt end, and perhaps you could reach out to them to ask. For instance can see the url being queried as

trace(RCurl::getForm, quote(print(uri)), exit=TRUE)
trace(read.delim, quote(print(file)), exit=TRUE)
select(UniprotDB, keys = For_uniprot_Mapping[1:2], columns = Uniprot_search1, keytype = "UNIPROTKB")

which gives in part

> select(UniprotDB, keys = For_uniprot_Mapping[1:2], columns = Uniprot_search1, keytype = "UNIPROTKB")
Getting mapping data for Q3TX55 ... and UNIGENE_ID
Tracing getForm(url, .params = params, .opts = list(FOLLOWLOCATION = TRUE)) on entry 
[1] "https://www.uniprot.org/mapping/"
Tracing getForm(url, .params = params, .opts = list(FOLLOWLOCATION = TRUE)) on exit 
Getting mapping data for Q3TX55 ... and P_ENTREZGENEID
Tracing getForm(url, .params = params, .opts = list(FOLLOWLOCATION = TRUE)) on entry 
[1] "https://www.uniprot.org/mapping/"
Tracing getForm(url, .params = params, .opts = list(FOLLOWLOCATION = TRUE)) on exit 
Getting mapping data for Q3TX55 ... and ENSEMBL_ID
Tracing getForm(url, .params = params, .opts = list(FOLLOWLOCATION = TRUE)) on entry 
[1] "https://www.uniprot.org/mapping/"
Tracing getForm(url, .params = params, .opts = list(FOLLOWLOCATION = TRUE)) on exit 
Getting extra data for Q3TX55, S4R267
Tracing read.delim(URLencode(url), stringsAsFactors = FALSE) on entry 
[1] "https://www.uniprot.org/uniprot/?query=Q3TX55+or+S4R267&format=tab&columns=id,entry%20name,genes,length,protein%20names,annotation%20score"
Tracing read.delim(URLencode(url), stringsAsFactors = FALSE) on exit

Each 'on entry / on exit' call represents time waiting for a response; copying the URLs into a web browser will be just as slow. The question for the UniProt people is why the url entered into a web browser is slow. (The 'https' above may be replaced by 'http' in the version that you're using; look for an update via biocLite() in the next several days that will use https).

ADD COMMENT
0
Entering edit mode

Alright then. In this case, I'll wait for UniProt to finish their transition to https and see whether after update(s) if that gets back to normal then. Would make total sense. Will see with them afterwards if the issue remains. Thanks a lot!

ADD REPLY
0
Entering edit mode

I’ve done some more tests and for what I can see:

  • Uniprot web page seems pretty ok. It has no issue handling the same full list of 1000 entries whatsoever. Submitting by hand through the website takes less than 5s to get the mapping.
  • If I split the 1000 entries in chunks (say map the whole list by 20 entries at a time) then it seems to overcome the issue. It’s slow but at least it can complete the task (party).
  • I am not sure whether Uniprot is the issue: the uniprot.ws package has no issue mapping 1000 keys very quickly as long as the columns used for the select command are "UNIGENE", "ENSEMBL", "ENTREZ_GENE". The Console prints the message “Getting mapping data for… ”, and all is good and very quick.
  •  The issue comes when the columns to be mapped are something else, like "PROTEIN-NAMES", "GENES","ENTRY-NAME". In this case, the console prints “Getting extra data for…”, and the mapping of the same number of entries is way slower, or even hang for rather small number of entries (chunks of 20 works, but 100 do not). Not sure how differently select handles the uniprot.ws object depending of the type of “columns” specified. But I reckon there lies the issue.
ADD REPLY

Login before adding your answer.

Traffic: 977 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6