Question: GenomicFeatures supportedUCSCFeatureDbTracks never returns and has to be terminated with C-c
gravatar for Malcolm Cook
12 months ago by
Malcolm Cook1.5k
United States
Malcolm Cook1.5k wrote:

In my hands, supportedUCSCFeatureDbTracks never returns and has to be terminated with C-c.

Any debugging help very welcome.

I've tried with hg19 in addition to dm6.  Below I report my linux setup, but it also fails under windows, R 3.5.1, GenomicFeatures_1.34.1

>  trks<-supportedUCSCFeatureDbTracks('dm6')
  C-c C-c
> sessionInfo()
R version 3.5.0 (2018-04-23)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: CentOS Linux 7 (Core)

Matrix products: default
BLAS: /n/apps/CentOS7/install/r-3.5.0/lib64/R/lib/
LAPACK: /n/apps/CentOS7/install/r-3.5.0/lib64/R/lib/

[1] C

attached base packages:
[1] stats4    parallel  stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] GenomicFeatures_1.32.0 AnnotationDbi_1.42.1   Biobase_2.40.0         GenomicRanges_1.32.6   GenomeInfoDb_1.16.0    IRanges_2.14.12        S4Vectors_0.18.3       BiocGenerics_0.26.0    devtools_1.13.6       

loaded via a namespace (and not attached):
 [1] Rcpp_0.12.19                compiler_3.5.0              XVector_0.20.0              prettyunits_1.0.2           bitops_1.0-6                tools_3.5.0                 zlibbioc_1.26.0             progress_1.2.0              biomaRt_2.36.1              digest_0.6.18               bit_1.1-14                  lattice_0.20-35             RSQLite_2.1.1               memoise_1.1.0               pkgconfig_2.0.2             rlang_0.3.0.1               Matrix_1.2-15               DelayedArray_0.6.6          DBI_1.0.0                   GenomeInfoDbData_1.1.0      rtracklayer_1.40.3         
[22] withr_2.1.2                 stringr_1.3.1               httr_1.3.1                  Biostrings_2.48.0           hms_0.4.2                   grid_3.5.0                  bit64_0.9-7                 R6_2.3.0                    BiocParallel_1.14.2         XML_3.98-1.12               blob_1.1.1                  magrittr_1.5                matrixStats_0.54.0          GenomicAlignments_1.16.0    Rsamtools_1.32.2            SummarizedExperiment_1.10.1 assertthat_0.2.0            stringi_1.2.4               RCurl_1.95-4.11             crayon_1.3.4               





genomicfeatures bug • 235 views
ADD COMMENTlink modified 12 months ago by James W. MacDonald52k • written 12 months ago by Malcolm Cook1.5k
Answer: GenomicFeatures supportedUCSCFeatureDbTracks never returns and has to be termina
gravatar for James W. MacDonald
12 months ago by
United States
James W. MacDonald52k wrote:

I think you are being too impatient. There are 35 tracks for dm6, and for each track you have to make a couple of queries to UCSC, which does take time. Even if I jam it out with a core for each query it takes a bit of time, mostly in transit (because the elapsed time is almost a minute, but the system time is not even two seconds):

> session <- browserSession()
> genome(session) <- "dm6"
> trx <- trackNames(ucscTableQuery(session))
> library(BiocParallel)
> system.time(bplapply(trx, GenomicFeatures:::isGoodTrack, session, BPPARAM = MulticoreParam(35)))
   user  system elapsed
  0.320   1.376  49.718

But waiting for that does look boring. Maybe it should be parallelized?

ADD COMMENTlink written 12 months ago by James W. MacDonald52k

Thanks, James, but I don't understand your response.  I commented on the function supportedUCSCFeatureDbTracks, which by my sights you did not exercise in your code, so, how could it address my issue?

ADD REPLYlink modified 12 months ago • written 12 months ago by Malcolm Cook1.5k
I ran the underlying code, which is normally run serially. I parallelized the slow step to show you that it does run. But since it's normally run serially, it will probably take about 35 minutes to run.
ADD REPLYlink written 12 months ago by James W. MacDonald52k

The code underlying supportedUCSCFeatureDbTracks is

> supportedUCSCFeatureDbTracks
function (genome) 
    session <- browserSession()
    genome(session) <- genome
    query <- ucscTableQuery(session)
    trx <- trackNames(ucscTableQuery(session))
    supported <- makeWhiteList(session, trx)

Which looks in part like what you are telling me is the underlying code, but not completely.

When I run the above line by line (adding namespace prefix as needed), I find this line is where things stall

    supported <- GenomicFeatures:::makeWhiteList(session, trx)

Which line is absent from what you are running.

... Ah, but, now I see, makeWhiteList in turn calls isGoodTrack.

Neither makeWhiteList nor isGoodTrack are exposed/documented so I don't know what to expect them to be doing.

It appears now to me that the line

    trx <- trackNames(ucscTableQuery(session))

tells me what I want to know.  Can you advise what further filtering for 'goodTracks' is supposed to provide me?


ADD REPLYlink modified 12 months ago • written 12 months ago by Malcolm Cook1.5k

I think we may be getting off point here. The main point I was trying to make, originally, was that the underlying code does work, and the fact that you were killing the process early had more to do with impatience on your part than some sort of error in the code.

To show that the underlying code does work, I had to expose some of the code and I parallelized it so I didn't have to wait for some amount of time =< 35 minutes in order to show that it does work, for some definition of 'work'.

I'm not sure it's critical for you to understand what isGoodTrack does, but I would imagine that it's not really that hard to infer? You already know that we are getting back a bunch of tracks from UCSC, and then a function called isGoodTrack generates something that can be used to subset that set of tracks to just those that can be used to create a FeatureDb. And the function name is isGoodTrack?

So, the main point here is that

  1. The code does work, and you are killing the process early because you think it's frozen when in fact it is diligently working.
  2. The diligent work that is being done is repeated 35 times, and each time takes about a minute, +/-, which ends up being pretty inefficient, and it may be useful to parallelize if we expect people to be able to use supportedUCSDFeatureDbTracks without getting bored and killing the process. The downside of doing that would be the requirement that GenomicFeatures would have to depend on BiocParallel, for this one function that until now I didn't know existed. So the tradeoff would be to make the GenomicFeatures package more complex to speed up a function that maybe people don't even use?
ADD REPLYlink written 12 months ago by James W. MacDonald52k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 16.09
Traffic: 147 users visited in the last hour