This works fine:
gscores(phastCons100way.UCSC.hg38, GRanges(tbl.fimo[, c("chrom", "start", "end")]))
GRanges object with 72 ranges and 1 metadata column:
as does this:
gscores(phastCons7way.UCSC.hg38, GRanges(tbl.fimo[, c("chrom", "start", "end")]))
GRanges object with 72 ranges and 1 metadata column:
but this fails:
gscores(phastCons30way.UCSC.hg38, GRanges(tbl.fimo[, c("chrom", "start", "end")]))
Error in gscores(phastCons30way.UCSC.hg38, GRanges(tbl.fimo[, c("chrom", :
object 'phastCons30way.UCSC.hg38' not found
I explicitly load all three phast files (using the 30way kindly provided by Robert Castelo a couple of months ago, and now also available in the AnnotationHub).`
sessionInfo() # with lots of information deleted, but the relevant information preserved
R Under development (unstable) (2020-01-29 r77745)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 16.04.6 LTS
locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
[3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=en_US.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
attached base packages:
[1] stats4 parallel stats graphics grDevices utils datasets
[8] methods base
other attached packages:
...
[13] TrenaProjectErythropoiesis_1.0.4 TrenaProjectHG38_1.2.7
[15] TrenaMultiScore_1.0.19 MotifDb_1.29.6
[17] Biostrings_2.55.6 XVector_0.27.1
[19] GenomicRanges_1.39.2 GenomeInfoDb_1.23.13
[21] IRanges_2.21.5 S4Vectors_0.25.13
[23] BiocGenerics_0.33.2 TrenaProject_1.2.5
loaded via a namespace (and not attached):
AnnotationHub_2.19.7
phastCons100way.UCSC.hg38_3.7.1
phastCons7way.UCSC.hg38_3.7.1
phastCons30way.UCSC.hg38_3.11.1
GenomicFeatures_1.39.7
GenomicScores_1.11.5
Thanks, Robert. Once again you have bailed me, quickly and accurately. I'm grateful.
It takes about 4 minutes for this function call to complete:
getGScores("phastCons30way.UCSC.hg38")
This is true even on repeated calls, when the data is almost certainly in my AH cache.However, after saving the object to disk as an RData file, loading takes less than a second.
This delay, along with the matter you discuss in your comment above, lead me to wonder if you see any virtue in offering an AH-independent option to GenomicScores. That is, have it do its work with conventionally
library(pkg)
loaded annotation packages?indeed, in my computer also takes about 3 minutes to load the scores from the AH cache. i think the bottleneck is not in loading the object but in checking whether all associated AH resources are in the cache. my understanding is that this step involves checking through the internet which are these files of the associated AH resources and this means probably that the checking time is proportional to the number of files involved. resources associated with hg38 have a large number of files (> 300) because of chromosome patch and alternate sequences. in the case of
phastCons30way.UCSC.hg38
they are 333. if i do the same test with, for instance,phastCons60way.UCSC.mm10
, which has 59 files it takes in my computer 43 seconds, which is about 4 times less thanphastCons30way.UCSC.hg38
, which has about 5 times more files. probably, the 99.99% of users never query the alternate and patch sequences of hg38 so one workaround would be to exclude those sequences from being AH resources, i'll take a look at that. in the meantime, once you've downloaded the AH resources, just build the corresponding annotation package with:where by setting
version="9.99.99"
, you ensure thatBiocManager::install()
will not update and replace your annotation package with the corresponding metadata package.as for why we are using AH for this, the reason is the same by which AH was created, which to my best understanding is to try to alleviate the number and weight of annotation packages, which puts a burden on the building systems at BioC.
Thanks, Robert. I will give that "build package" strategy a try. AH has always been a bit opaque to me; the background you offered is most helpful.