I always assumed that AnnotationHub/ExperimentHub data is stored in a single Bioconductor-managed location. ChatGPT thinks so as well, so there is enough text to suggest that. However, as stated in Creating A Hub Package:
If you are not hosting the data on a stable web server (github and dropbox does not suffice), you should look into a stable option. We highly recommend zenodo; other options can include cloudflare, S3 buckets, mircorsoft azure data lake, or an institutional level server. ... When at all possible data should be hosted on a publicaly accessible site designated by the package maintainer. ... In general, resources are only removed when they are no longer available (e.g., moved from web location, no longer provided etc.).
Although there is obviously a central AnnotationHub/ExperimentHub database hosted somewhere, the actual data is pulled from the original source, which could become temporarily/permanently unavailable. I think most of us have experienced the reliability of an "institutional level server" cited in older publications. More recent data storage repositories like Zenodo or Figshare should be fairly reliable, but until recently I would have thought NIH grants were guaranteed. Even if the repository is fine, specific datasets could be removed by accident.
Perhaps I am just misunderstanding how the system works. Should users not assume that AnnotationHub/ExperimentHub resources are perfectly reproducible in perpetuity? Should users have a personal backup of important resources?
Yes, it is cached on your machine, but it wouldn't be cached for your colleagues or for your future self on a different machine.
I agree that "reproducible in perpetuity" may have been too bold of a statement. What I really meant is that ExperimentHub is likely more reliable than some institutional level server. Even if it goes away, at least someone will make a post about it.
You can always grab things from the cache if you want to save or share, but that's not part of the paradigm, so there aren't direct facilities to do so. But it's not that difficult either.
As an example, using an
EnsDb
object.As far as perpetuity goes, many of these things are submitted by end users. They can ask for their data to be hosted on Azure, but that costs money (for Bioconductor), so there has to be some picking and choosing. In an ideal world all the stuff would be hosted somewhere safe, but we don't live in an ideal world, so we muddle along as best we can.