UniProt.ws - Speeding up select() for data retrieval
1
0
Entering edit mode
JavierM • 0
Last seen 22 months ago

Hi,

I am writing because I am using UniProt.ws package for proteome retrieval. I'm trying to do this by giving as input the taxon ID for a certain organism. So I get the IDs with keys() and then I use select() to retrieve the information I need from each ID. The problem is if I try downloading the human proteome, the select() part is really time consuming. As an example:


library(UniProt.ws)
up <- UniProt.ws(9606)
egs = keys(up, "UNIPROTKB")
length(egs)
[1] 192814
# Almost 200k UniprotKB IDs mapping to Homo sapiens

# The message I get when I run the select() line is:
res <- select(up, keys = egs, columns = c("ORGANISM","UNIPROTKB", "REVIEWED", "LENGTH" ,"SEQUENCE"), keytype = "UNIPROTKB")
Uniprot limits queries with a large amount of keys. It's recommended that the select method be invoked with fewer than 100 keys or the query may fail.
Getting extra data for Q8N7X0, Q5T1N1, Q92667... (400 total)
Getting extra data for Q14094, Q8TBY9, Q8WUH1... (400 total)
Getting extra data for B4DZS4, Q9Y4R8, A0A087X1G2... (400 total)
Getting extra data for Q4KMQ1, Q12815, O94811... (400 total)
Getting extra data for A4D0V7, Q14894, Q13324... (400 total)
Getting extra data for O14618, Q8IZV2, O76039... (400 total)
Getting extra data for P29966, Q8IVH8, A1Z1Q3... (400 total)
Getting extra data for A0A0C4DH26, Q9BQK8, A0A0B4J2D9... (400 total)

Timing stopped at: 1.107 0.109 74.67

# I stopped it after the first minute


Taking into account that the number of UniprotKB IDs associated to Homo sapiens is 192k aprox and it takes roughly 1 minute to download ~3k IDs, it takes more than 1 hour to download all the entries. I was wondering if there would be a way to speed-up this process? Because for species with thousands of entries, it takes a while to retrieve them all. I'm asking this because I intend to use this often for many organisms.

Thanks a lot,

Javier

UniProt.ws UniProt • 331 views
1
Entering edit mode
@james-w-macdonald-5106
Last seen 34 minutes ago
United States

The problem with UniProt.ws is that you can only request results for a limited number of identifiers at one time, and repeated requests within a very short period of time is frowned upon and can get your IP banned (repeated requests in a tight for loop are how you do a denial of service attack). So what happens under the hood is that subsets of your IDs are sent off in repeated requests, separated by enough time that you won't get banned. Which obviously is going to take an inordinate amount of time.

It's been a while since I looked at the data availability at Uniprot.org, but for bulk requests downloading all the data from them and then parsing is probably the way to go. You would have to investigate how to do that yourself.