Search
Question: Eliminating repetitive calls to ExperimentHub
2
4 months ago by

Background: I am creating a new package, let's call it AnalysisPackage to be submitted to bioconductor. AnalysisPackage plots maps of the human brain colored by the enrichment or depletion of gene sets. The maps for each brain region are pretty large. In total, if I were to save all of this in sysdata.rda, the file would be ~25MB. This is far over the 4MB limit for bioconductor packages so I have generated an additional ExperimentData package, lets call it DataPackage which I now call with the ExperimentHub() function.

The problem: There are multiple functions in AnalysisPackage that require data that's stored in DataPackage. There are also nested functions in AnalysisPackage. This means that every time the end user runs a function there are 6-12 repetitive calls to ExperimentHub. This adds a frustrating amount of time to each process.

The question: Is there a way to automatically load the data from DataPackage into memory when the user runs library(AnalysisPackage) so that I don't have to continuously interact with ExperimentHub? Alternatively, has anyone found any different solution around this type of problem? The only thing that I can come up with is passing the data from one function to another, though that would create an unnecessarily large data object for the end user to have to deal with. This doesn't seem like the optimal strategy.

written 4 months ago by slinker20
3

I read your question as trying to avoid the cost of reading the data from disk. One option is to 'memoize' data. A simple example is

> library(memoize)
​> f = function(i) { Sys.sleep(i); i }
> fm = memoize(f)
> system.time(fm(1))
user  system elapsed
0.004   0.000   1.002
> system.time(fm(1))
user  system elapsed
0.024   0.000   0.028
> system.time(fm(2))
user  system elapsed
0.000   0.000   2.005
> system.time(fm(2))
user  system elapsed
0       0       0 

The idea would be to write an (internal) helper function such as

.hub <- ExperimentHub::ExperimentHub()

.helper <- memoize(function(ehid) {
.hub[[ehid]]
})

This loads data on first use, so not all users would pay the cost of loading data. One would want to take additional precautions if this were to be used in a parallel evaluation context.

There are likely other approaches, e.g., using an .onLoad() function to load data

.cache <- new.env(parent=emptyenv())
hub <- ExperimentHub()
.cache[["EH123"]] <- hub[["EH123"]]
...
}

And then reference .cache[["EH123"]] in your code.

Questions about package development are better addressed to the bioc-devel mailing list

memoize is super neat, thanks Martin.

Yes, memoize is super neat!