For the Galaxy project we build quite a few containers that contain data packages (I hope I use the correct terms here) like bioconductor-org.Hs.eg.db
. Often it's many of such data packages which leads to large container sizes. Since these libraries are actually only data (plus a bit of boiler plate code?) I was wondering if it is possible (or a good idea) to provide such packages outside of the container in a central storage location. Before calling an R script one could export R_LIBS=/central_library:$R_LIBS
and load the package.
My biggest concerns are:
- do these data packages have additional (non-data, i.e. software) requirements? Which I would need to install in the container.
- to my understanding bioconductor libraries have "coupled" versions, i.e. there is exactly one combination of bioconductor packages that can be installed. Would it be possible to load a wrong version, i.e. is the version checking also done when loading a package?
One additional question
- Is there different data in the sqlite databases with each release or might different releases actually have the same data?
An alternative would be to not use data packages but software packages, such as
ensembldb
orbiomaRt
, to connect to the data when needed: It might be less reproducible (you could fix release version or similar configurations to retrieve the same data at different timepoints) but you reduce the amount of storage needed at the cost of depending on connection and availability of the resources.