Question: Eliminating repetitive calls to ExperimentHub
gravatar for slinker
8 weeks ago by
slinker20 wrote:

Background: I am creating a new package, let's call it AnalysisPackage to be submitted to bioconductor. AnalysisPackage plots maps of the human brain colored by the enrichment or depletion of gene sets. The maps for each brain region are pretty large. In total, if I were to save all of this in sysdata.rda, the file would be ~25MB. This is far over the 4MB limit for bioconductor packages so I have generated an additional ExperimentData package, lets call it DataPackage which I now call with the ExperimentHub() function. 

The problem: There are multiple functions in AnalysisPackage that require data that's stored in DataPackage. There are also nested functions in AnalysisPackage. This means that every time the end user runs a function there are 6-12 repetitive calls to ExperimentHub. This adds a frustrating amount of time to each process.

The question: Is there a way to automatically load the data from DataPackage into memory when the user runs library(AnalysisPackage) so that I don't have to continuously interact with ExperimentHub? Alternatively, has anyone found any different solution around this type of problem? The only thing that I can come up with is passing the data from one function to another, though that would create an unnecessarily large data object for the end user to have to deal with. This doesn't seem like the optimal strategy.

Thanks in advance- Sara





ADD COMMENTlink written 8 weeks ago by slinker20

I read your question as trying to avoid the cost of reading the data from disk. One option is to 'memoize' data. A simple example is

> library(memoize)
​> f = function(i) { Sys.sleep(i); i }
> fm = memoize(f)
> system.time(fm(1))
   user  system elapsed 
  0.004   0.000   1.002 
> system.time(fm(1))
   user  system elapsed 
  0.024   0.000   0.028 
> system.time(fm(2))
   user  system elapsed 
  0.000   0.000   2.005 
> system.time(fm(2))
   user  system elapsed 
      0       0       0 

The idea would be to write an (internal) helper function such as

.hub <- ExperimentHub::ExperimentHub()

.helper <- memoize(function(ehid) {

This loads data on first use, so not all users would pay the cost of loading data. One would want to take additional precautions if this were to be used in a parallel evaluation context.

There are likely other approaches, e.g., using an .onLoad() function to load data

.cache <- new.env(parent=emptyenv())
.onLoad <- function(...) {
    hub <- ExperimentHub()
    .cache[["EH123"]] <- hub[["EH123"]]

And then reference .cache[["EH123"]] in your code.

Questions about package development are better addressed to the bioc-devel mailing list


ADD REPLYlink modified 8 weeks ago • written 8 weeks ago by Martin Morgan ♦♦ 22k

`memoize` is super neat, thanks Martin.

ADD REPLYlink written 8 weeks ago by Levi Waldron600

Yes, memoize is super neat!

ADD REPLYlink modified 8 weeks ago • written 8 weeks ago by Lucas Schiffer220

Thank you so much Martin this is a perfect answer! Also I apologize for posting to the wrong list. I'll be sure to post to bioc-devel next time.

ADD REPLYlink written 8 weeks ago by slinker20
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.2.0
Traffic: 129 users visited in the last hour