enabling reproducible research & R package management & install.package.version & BiocLite
1
0
Entering edit mode
Malcolm Cook ★ 1.6k
@malcolm-cook-6293
Last seen 8 days ago
United States
Hi, In support of reproducible research at my Institute, I seek an approach to re-creating the R environments in which an analysis has been conducted. By which I mean, the exact version of R and the exact version of all packages used in a particular R session. I am seeking comments/criticism of this as a goal, and of the following outline of an approach: === When all the steps to an workflow have been finalized === * re-run the workflow from beginning to end * save the results of sessionInfo() into an RDS file named after the current date and time. === Later, when desirous of exactly recreating this analysis === * read the (old) sessionInfo() into an R session * exit with failure if the running version of R doesn't match * compare the old sessionInfo to the currently available installed libraries (i.e. using packageVersion) * where there are discrepancies, install the required version of the package (without dependencies) into new library (named after the old sessionInfo RDS file) Then the analyst should be able to put the new library into the front of .libPaths and run the analysis confident that the same version of the packages. I have in that past used install-package-version.R to revert to previous versions of R packages successfully (. And there is a similar tool in Hadley Wickhams devtools. But, I don't know if I need something special for (BioConductor) packages that have been installed using biocLite and seek advice here. I do understand that the R environment is not sufficient to guarantee reproducibility. Some of my colleagues have suggested saving a virtual machine with all your software/library/data installed. So, I am also in general interested in what other people are doing to this end. But I am most interested in: * is this a good idea * is there a worked out solution * does biocLite introduce special cases * where do the dragons lurk ... and the like Any tips? Thanks, ~ Malcolm Cook Stowers Institute / Computation Biology / Shilatifard Lab
• 1.7k views
ADD COMMENT
0
Entering edit mode
Aaron Mackey ▴ 170
@aaron-mackey-4358
Last seen 9.6 years ago
On Mon, Mar 4, 2013 at 4:13 PM, Cook, Malcolm <mec@stowers.org> wrote: > * where do the dragons lurk > webs of interconnected dynamically loaded libraries, identical versions of R compiled with different BLAS/LAPACK options, etc. Go with the VM if you really, truly, want this level of exact reproducibility. An alternative (and arguably more useful) strategy would be to cache results of each computational step, and report when results differ upon re-execution with identical inputs; if you cache sessionInfo along with each result, you can identify which package(s) changed, and begin to hunt down why the change occurred (possibly for the better); couple this with the concept of keeping both code *and* results in version control, then you can move forward with a (re)analysis without being crippled by out-of- date software. -Aaron -- Aaron J. Mackey, PhD Assistant Professor Center for Public Health Genomics University of Virginia amackey@virginia.edu http://www.cphg.virginia.edu/mackey [[alternative HTML version deleted]]
ADD COMMENT
0
Entering edit mode
On Mon, Mar 4, 2013 at 4:28 PM, Aaron Mackey <amackey at="" virginia.edu=""> wrote: > On Mon, Mar 4, 2013 at 4:13 PM, Cook, Malcolm <mec at="" stowers.org=""> wrote: > >> * where do the dragons lurk >> > > webs of interconnected dynamically loaded libraries, identical versions of > R compiled with different BLAS/LAPACK options, etc. Go with the VM if you > really, truly, want this level of exact reproducibility. Sounds like the best bet -- maybe tools like vagrant might be useful here: http://www.vagrantup.com ... or maybe they're overkill? Haven't really checked it out myself too much, my impression is that these tools (vagrant, chef, puppet) are built to handle such cases. I'd imagine you'd probably need a location where you can grab the precise (versioned) packages for the things you are specifying, but ... -steve -- Steve Lianoglou Graduate Student: Computational Systems Biology | Memorial Sloan-Kettering Cancer Center | Weill Medical College of Cornell University Contact Info: http://cbio.mskcc.org/~lianos/contact
ADD REPLY
0
Entering edit mode
On Mon, Mar 4, 2013 at 2:15 PM, Steve Lianoglou <mailinglist.honeypot at="" gmail.com=""> wrote: > On Mon, Mar 4, 2013 at 4:28 PM, Aaron Mackey <amackey at="" virginia.edu=""> wrote: >> On Mon, Mar 4, 2013 at 4:13 PM, Cook, Malcolm <mec at="" stowers.org=""> wrote: >> >>> * where do the dragons lurk >>> >> >> webs of interconnected dynamically loaded libraries, identical versions of >> R compiled with different BLAS/LAPACK options, etc. Go with the VM if you >> really, truly, want this level of exact reproducibility. > > Sounds like the best bet -- maybe tools like vagrant might be useful here: > > http://www.vagrantup.com > > ... or maybe they're overkill? > > Haven't really checked it out myself too much, my impression is that > these tools (vagrant, chef, puppet) are built to handle such cases. > > I'd imagine you'd probably need a location where you can grab the > precise (versioned) packages for the things you are specifying, but Right...and this is a bit tricky, because we don't keep old versions around in our BioC software repositories. They are available through Subversion but with the sometimes additional overhead of setting up build-time dependencies. Dan > ... > > -steve > > -- > Steve Lianoglou > Graduate Student: Computational Systems Biology > | Memorial Sloan-Kettering Cancer Center > | Weill Medical College of Cornell University > Contact Info: http://cbio.mskcc.org/~lianos/contact > > ______________________________________________ > R-devel at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-devel
ADD REPLY
0
Entering edit mode
.>>> * where do the dragons lurk .>>> .>> .>> webs of interconnected dynamically loaded libraries, identical versions of .>> R compiled with different BLAS/LAPACK options, etc. Go with the VM if you .>> really, truly, want this level of exact reproducibility. .> .> Sounds like the best bet -- maybe tools like vagrant might be useful here: .> .> http://www.vagrantup.com .> .> ... or maybe they're overkill? .> .> Haven't really checked it out myself too much, my impression is that .> these tools (vagrant, chef, puppet) are built to handle such cases. .> .> I'd imagine you'd probably need a location where you can grab the .> precise (versioned) packages for the things you are specifying, but . .Right...and this is a bit tricky, because we don't keep old versions .around in our BioC software repositories. They are available through .Subversion but with the sometimes additional overhead of setting up .build-time dependencies. So, even if I wanted to go where dragons lurked, it would not be possible to cobble a version of biocLite that installed specific versions of software. Thus, I might rather consider an approach that at 'publish' time tarzips up a copy of the R package dependencies based on a config file defined from sessionInfo and caches it in the project directory. Then when/if the project is revisited (and found to produce differnt results under current R enviRonment), I can "simply" install an old R (oops, I guess I'd have to build it), and then un-tarzip the dependencies into the projects own R/Library which I would put on .libpaths. Or, or? (My virtual machine advocating colleagues are snickering now, I am sure......) Thanks for all your thoughts and advices.... --Malcolm . . .> ... .> .> -steve .> .> -- .> Steve Lianoglou .> Graduate Student: Computational Systems Biology .> | Memorial Sloan-Kettering Cancer Center .> | Weill Medical College of Cornell University .> Contact Info: http://cbio.mskcc.org/~lianos/contact .> .> ______________________________________________ .> R-devel at r-project.org mailing list .> https://stat.ethz.ch/mailman/listinfo/r-devel . ._______________________________________________ .Bioconductor mailing list .Bioconductor at r-project.org .https://stat.ethz.ch/mailman/listinfo/bioconductor .Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
ADD REPLY
0
Entering edit mode
On Tue, 5 Mar 2013, Cook, Malcolm wrote: > Thus, I might rather consider an approach that at 'publish' time tarzips > up a copy of the R package dependencies based on a config file defined > from sessionInfo and caches it in the project directory. If you had a separate environment for every project, each with its own R installation and R installation lib.loc this becomes rather easy. For instance, something like this: myProject/ projectRInstallation/ bin/ R library/ Biobase annotate ..... .... projectData/ projectCode/ projectOutput/ The directory structure would likely be more complicated than that but something along those lines. This way all code, data *and* compute environment are always linked together. -J
ADD REPLY

Login before adding your answer.

Traffic: 995 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6