GenomicDataCommons request timeouts when cases() %>% ... %>% results_all()
1
1
Entering edit mode
mk ▴ 20
@mk-14473
Last seen 2.2 years ago
United States

This issue was already posted on Biostars, just reporting it here for completeness.

https://www.biostars.org/p/359486/#359489

genomicdatacommons tcga gdc rest http • 1.1k views
ADD COMMENT
2
Entering edit mode
@sean-davis-490
Last seen 3 months ago
United States

I should clean up the documentation, but results_all() is a convenience wrapper that is not too smart in that it simply tries to return all results in one trip to the server. This can fail for multiple reasons related to the size of result sets. The better approach (and the only one in the case of large results sets) is to page through the results:

proj <- 'TCGA-COAD'
query = cases() %>%
    GenomicDataCommons::filter(~ project.project_id == proj) %>%
    GenomicDataCommons::expand('diagnoses')
count = query %>% count()
size = 50
reslist = lapply(seq(1,count, size), function(page) {
    query %>% 
        results(size=size, from = page) %>%
        as_tibble()
})
case_data = bind_rows(reslist)

Unfortunately, the size parameter really requires trial-and-error to find the largest "working" setting since the results can vary quite significantly in volume. Instead, I usually just choose a smallish number like 50 or so and wait a few extra seconds. These calls can, in theory, be parallelized using something like BiocParallel to get really fancy (and introduce complexity).

ADD COMMENT
1
Entering edit mode

Thanks @Sean Davis this is most helpful.

ADD REPLY

Login before adding your answer.

Traffic: 860 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6