Question

Use of data packages in containers

2

Entering edit mode

M ▴ 20

@893555ac

Last seen 4 months ago

Germany

For the Galaxy project we build quite a few containers that contain data packages (I hope I use the correct terms here) like bioconductor-org.Hs.eg.db. Often it's many of such data packages which leads to large container sizes. Since these libraries are actually only data (plus a bit of boiler plate code?) I was wondering if it is possible (or a good idea) to provide such packages outside of the container in a central storage location. Before calling an R script one could export R_LIBS=/central_library:$R_LIBS and load the package.

My biggest concerns are:

do these data packages have additional (non-data, i.e. software) requirements? Which I would need to install in the container.
to my understanding bioconductor libraries have "coupled" versions, i.e. there is exactly one combination of bioconductor packages that can be installed. Would it be possible to load a wrong version, i.e. is the version checking also done when loading a package?

One additional question

Is there different data in the sqlite databases with each release or might different releases actually have the same data?

org.Hs.eg.db • 425 views

ADD COMMENT • link updated 3 months ago by Lluís Revilla Sancho ▴ 730 • written 4 months ago by M ▴ 20

0

Entering edit mode

An alternative would be to not use data packages but software packages, such as ensembldb or biomaRt, to connect to the data when needed: It might be less reproducible (you could fix release version or similar configurations to retrieve the same data at different timepoints) but you reduce the amount of storage needed at the cost of depending on connection and availability of the resources.

ADD REPLY • link 3 months ago Lluís Revilla Sancho ▴ 730

score 0 · Answer 1 · 2023-12-19

0

Entering edit mode

James W. MacDonald 65k

@james-w-macdonald-5106

Last seen 23 minutes ago

United States

The data in these packages does vary, in particular the OrgDb packages, which are a snapshot of whatever NCBI had available to download at the time the packages were built. Since much of the data at NCBI is updated weekly, an OrgDb that is part of say Bioc-3.16 is likely to have significant differences as compared to the version that is installed in Bioc-3.18. All of the packages have dependencies that provide the methods for querying. If you ensure that all possible dependent packages are pre-installed (from memory: AnnotationDbi, GenomicFeatures, BSgenome, maybe others?), then you don't need to ensure that they are installed.

If you include BiocManager (a CRAN package) and run

library(BiocManager)
BiocManager::valid()

It will check the versions and report discrepancies. If you are using say, Singularity or Docker containers, the packages are all binary and install quite quickly, so it might be easier to provide a startup script that checks for and installs whatever packages are required?

ADD COMMENT • link 4 months ago James W. MacDonald 65k

0

Entering edit mode

Thanks for the reply.

So lets assume I have installed org.Hs.eg.db from Bioc-3.16.0 in a directory (without any other requirements).

This can be expected to run in any container having the requirements R (>= 2.7.0), methods, AnnotationDbi(>= 1.59.1) installed. In particular I'm wondering if I can access the data also with code from later Bioc releases, i.e. newer containers.

ADD REPLY • link 4 months ago M ▴ 20

0

Entering edit mode

Currently the answer is yes. The underlying structure of the data packages hasn't (IIRC) materially changed in years, so you could grab an OrgDb from 2017 and install it and have it work with any version of Bioc between then and now.

But the reason the annotation packages are tightly tied to the Bioconductor version has to do with reproducibility. As an example, let's say I did an analysis for a collaborator in 2019, using the release version of all the packages. If they come back and want me to do an additional comparison, and I use the current release version, then the annotation of the genes may change substantially. NCBI (and Ensembl) update their annotations regularly, which can include all sorts of changes. One possible change is to the Gene ID. If in the intervening period NCBI decided gene 12345 is actually the same thing as gene 34342 and merged them, and gene 12345 was the top gene in the old analysis, my collaborator is going to be dismayed to find out that their top gene has now disappeared (because gene id 12345 no longer exists in the new annotation file). To preclude that from happening, I would use the same version of R/Bioc for the updated analysis unless they wanted to update the annotations (and in which case I would align to a new version of the genome that has the updated gene IDs).

The same issue can happen in reverse. If you have an old annotation package, and the person using the container has aligned to the most current version of the genome, there may be any number of mismatches between the IDs that the end user has, and what is available in the annotation package.

ADD REPLY • link 4 months ago James W. MacDonald 65k