GenomicScores cannot find phastCons30way.UCSC.hg38 though package is loaded
1
0
Entering edit mode
Paul Shannon ▴ 470
@paul-shannon-5944
Last seen 2.5 years ago
United States

This works fine:

gscores(phastCons100way.UCSC.hg38, GRanges(tbl.fimo[, c("chrom", "start", "end")]))
 GRanges object with 72 ranges and 1 metadata column:

as does this:

gscores(phastCons7way.UCSC.hg38, GRanges(tbl.fimo[, c("chrom", "start", "end")]))
 GRanges object with 72 ranges and 1 metadata column:

but this fails:

gscores(phastCons30way.UCSC.hg38, GRanges(tbl.fimo[, c("chrom", "start", "end")]))
 Error in gscores(phastCons30way.UCSC.hg38, GRanges(tbl.fimo[, c("chrom",  :
   object 'phastCons30way.UCSC.hg38' not found

I explicitly load all three phast files (using the 30way kindly provided by Robert Castelo a couple of months ago, and now also available in the AnnotationHub).`

sessionInfo() # with lots of information deleted, but the relevant information preserved

R Under development (unstable) (2020-01-29 r77745)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 16.04.6 LTS

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C
 [9] LC_ADDRESS=C               LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C

attached base packages:
[1] stats4    parallel  stats     graphics  grDevices utils     datasets
[8] methods   base

other attached packages:
...
[13] TrenaProjectErythropoiesis_1.0.4   TrenaProjectHG38_1.2.7
[15] TrenaMultiScore_1.0.19             MotifDb_1.29.6
[17] Biostrings_2.55.6                  XVector_0.27.1
[19] GenomicRanges_1.39.2               GenomeInfoDb_1.23.13
[21] IRanges_2.21.5                     S4Vectors_0.25.13
[23] BiocGenerics_0.33.2                TrenaProject_1.2.5

loaded via a namespace (and not attached):
  AnnotationHub_2.19.7
  phastCons100way.UCSC.hg38_3.7.1
  phastCons7way.UCSC.hg38_3.7.1
  phastCons30way.UCSC.hg38_3.11.1
  GenomicFeatures_1.39.7
  GenomicScores_1.11.5


annotation • 808 views
ADD COMMENT
1
Entering edit mode
Robert Castelo ★ 3.4k
@rcastelo
Last seen 2 days ago
Barcelona/Universitat Pompeu Fabra

hi Paul,

i think i know what's happening. one quick solution is to download the scores as AnnotationHub resources from your current R-devel BioC installation. this means doing:

library(GenomicScores)

phast <- getGScores("phastCons30way.UCSC.hg38")
snapshotDate(): 2020-02-28
download 333 resources? [y/n] y
  |======================================================================| 100%
[...]
  |======================================================================| 100%
phast
GScores object 
# organism: Homo sapiens (UCSC, hg38)
# provider: UCSC
# provider version: 03Nov2017
# download date: Jan 10, 2020
# loaded sequences: default
# maximum abs. error: 0.05
# use 'citation()' to cite these data in publications
gscores(phast, GRanges("chr7:117592326-117592330"))
GRanges object with 1 range and 1 metadata column:
      seqnames              ranges strand |   default
         <Rle>           <IRanges>  <Rle> | <numeric>
  [1]     chr7 117592326-117592330      * |      0.58
  -------
  seqinfo: 1 sequence from Genome Reference Consortium GRCh38 genome; no seqlengths

now, the problem is with generating a standard annotation package with the function makeGScoresPackage(), installing the package and then updating BioC packages. by default the generated package is going to have, as of March 2020, version 3.11.0, note that your phastCons30way.UCSC.hg38 package does not have version 3.11.0 but 3.11.1. this is because the BioC core team has decided that AnnotationHub resources must have a corresponding annotation package, not holding the data, but metadata, that is just having a manual page with the information about how the data was contributed and by whom. you can see that this package already exists in BioC devel here http://www.bioconductor.org/packages/devel/data/annotation/html/phastCons30way.UCSC.hg38.html. therefore, if you've self-created annotation package has version 3.11.0, when BioC is updated, it will be replaced by this metadata package.

there are situations in which one may want to have genomic scores stored in an standard annotation package, as for instance, when deploying its use in a cluster behind a firewall that precludes online access to the AnnotationHub, so this metadata package gets a bit in the way of that purpose because of having identical name. so, by now, if you want to use the annotation package, the only workaround i see is to increase the version of your annotation package to 3.11.2, re-build and install and it won't get updated when you update BioC packages.

cheers,

robert.

ADD COMMENT
0
Entering edit mode

Thanks, Robert. Once again you have bailed me, quickly and accurately. I'm grateful.

It takes about 4 minutes for this function call to complete: getGScores("phastCons30way.UCSC.hg38") This is true even on repeated calls, when the data is almost certainly in my AH cache.

However, after saving the object to disk as an RData file, loading takes less than a second.

This delay, along with the matter you discuss in your comment above, lead me to wonder if you see any virtue in offering an AH-independent option to GenomicScores. That is, have it do its work with conventionally library(pkg) loaded annotation packages?

ADD REPLY
0
Entering edit mode

indeed, in my computer also takes about 3 minutes to load the scores from the AH cache. i think the bottleneck is not in loading the object but in checking whether all associated AH resources are in the cache. my understanding is that this step involves checking through the internet which are these files of the associated AH resources and this means probably that the checking time is proportional to the number of files involved. resources associated with hg38 have a large number of files (> 300) because of chromosome patch and alternate sequences. in the case of phastCons30way.UCSC.hg38 they are 333. if i do the same test with, for instance, phastCons60way.UCSC.mm10, which has 59 files it takes in my computer 43 seconds, which is about 4 times less than phastCons30way.UCSC.hg38, which has about 5 times more files. probably, the 99.99% of users never query the alternate and patch sequences of hg38 so one workaround would be to exclude those sequences from being AH resources, i'll take a look at that. in the meantime, once you've downloaded the AH resources, just build the corresponding annotation package with:

makeGScoresPackage(phastobj, version="9.99.99", maintainer="you <you@example.com>", author="you")

where by setting version="9.99.99", you ensure that BiocManager::install() will not update and replace your annotation package with the corresponding metadata package.

as for why we are using AH for this, the reason is the same by which AH was created, which to my best understanding is to try to alleviate the number and weight of annotation packages, which puts a burden on the building systems at BioC.

ADD REPLY
0
Entering edit mode

Thanks, Robert. I will give that "build package" strategy a try. AH has always been a bit opaque to me; the background you offered is most helpful.

  • Paul
ADD REPLY

Login before adding your answer.

Traffic: 827 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6