Question

AnnotationHub/ExperimentHub resources backup

0

Entering edit mode

igor ▴ 50

@igor

Last seen 3 months ago

United States

I always assumed that AnnotationHub/ExperimentHub data is stored in a single Bioconductor-managed location. ChatGPT thinks so as well, so there is enough text to suggest that. However, as stated in Creating A Hub Package:

If you are not hosting the data on a stable web server (github and dropbox does not suffice), you should look into a stable option. We highly recommend zenodo; other options can include cloudflare, S3 buckets, mircorsoft azure data lake, or an institutional level server. ... When at all possible data should be hosted on a publicaly accessible site designated by the package maintainer. ... In general, resources are only removed when they are no longer available (e.g., moved from web location, no longer provided etc.).

Although there is obviously a central AnnotationHub/ExperimentHub database hosted somewhere, the actual data is pulled from the original source, which could become temporarily/permanently unavailable. I think most of us have experienced the reliability of an "institutional level server" cited in older publications. More recent data storage repositories like Zenodo or Figshare should be fairly reliable, but until recently I would have thought NIH grants were guaranteed. Even if the repository is fine, specific datasets could be removed by accident.

Perhaps I am just misunderstanding how the system works. Should users not assume that AnnotationHub/ExperimentHub resources are perfectly reproducible in perpetuity? Should users have a personal backup of important resources?

AnnotationHub HubPub ExperimentHub • 615 views

ADD COMMENT • link updated 6 months ago by James W. MacDonald 68k • written 6 months ago by igor ▴ 50

score 2 · Accepted Answer · 2025-05-14

2

Entering edit mode

James W. MacDonald 68k

@james-w-macdonald-5106

Last seen 12 hours ago

United States

You by definition have a personal backup of important resources - the first time you download something from the AnnotationHub or ExperimentHub, it is saved in a cache directory on your local machine. And if you need that resource later, it is accessed from the cache rather than downloaded again.

As far as something being perfectly reproducible in perpetuity, I am not sure that's a thing. Bioconductor itself exists because of NIH grants, and most of the resources we take for granted also exist due to NIH grants. If the eye of Sauron turns to point at Bioconductor, you can rest assured it will burn up just like any of the other things that got DOGEd this year.

ADD COMMENT • link 6 months ago James W. MacDonald 68k

0

Entering edit mode

Yes, it is cached on your machine, but it wouldn't be cached for your colleagues or for your future self on a different machine.

I agree that "reproducible in perpetuity" may have been too bold of a statement. What I really meant is that ExperimentHub is likely more reliable than some institutional level server. Even if it goes away, at least someone will make a post about it.

ADD REPLY • link 6 months ago igor ▴ 50

1

Entering edit mode

You can always grab things from the cache if you want to save or share, but that's not part of the paradigm, so there aren't direct facilities to do so. But it's not that difficult either.

As an example, using an EnsDb object.

> library(AnnotationHub)
> hub <- AnnotationHub()
snapshotDate(): 2024-10-28
> ensdb <- hub[["AH119325"]]
loading from cache
require("ensembldb")
> ensdb
EnsDb for Ensembl:
|Backend: SQLite
|Db type: EnsDb
|Type of Gene ID: Ensembl Gene ID
|Supporting package: ensembldb
|Db created by: ensembldb package from Bioconductor
|script_version: 0.3.10
|Creation time: Sat Oct 26 21:34:14 2024
|ensembl_version: 113
|ensembl_host: 127.0.0.1
|Organism: Homo sapiens
|taxonomy_id: 9606
|genome_build: GRCh38
|DBSCHEMAVERSION: 2.2
|common_name: human
|species: homo_sapiens
| No. of genes: 87726.
| No. of transcripts: 413674.
|Protein data available.

## Let's say we want to keep this Db to share
> library(RSQLite)
## copy the database to the working dir
> sqliteCopyDatabase(dbconn(ensdb), "iwannakeepthis.sqlite")
## load the saved Db 
> ensdb2 <- EnsDb("iwannakeepthis.sqlite")
> ensdb2
EnsDb for Ensembl:
|Backend: SQLite
|Db type: EnsDb
|Type of Gene ID: Ensembl Gene ID
|Supporting package: ensembldb
|Db created by: ensembldb package from Bioconductor
|script_version: 0.3.10
|Creation time: Sat Oct 26 21:34:14 2024
|ensembl_version: 113
|ensembl_host: 127.0.0.1
|Organism: Homo sapiens
|taxonomy_id: 9606
|genome_build: GRCh38
|DBSCHEMAVERSION: 2.2
|common_name: human
|species: homo_sapiens
| No. of genes: 87726.
| No. of transcripts: 413674.
|Protein data available.

As far as perpetuity goes, many of these things are submitted by end users. They can ask for their data to be hosted on Azure, but that costs money (for Bioconductor), so there has to be some picking and choosing. In an ideal world all the stuff would be hosted somewhere safe, but we don't live in an ideal world, so we muddle along as best we can.

ADD REPLY • link 6 months ago James W. MacDonald 68k