Best practices for sharing RMarkdown-based supplemental PDFs?
1
0
Entering edit mode
Keith Hughitt ▴ 180
@keith-hughitt-6740
Last seen 12 weeks ago
United States

Greetings!

As I'm going through and cleaning up some RMarkdown-based analyses in preparation for inclusion as supplemental PDF's, I thought it might be interesting to see what kinds of things people consider "best practices" for generating these documents in such a way to make them maximally useful / reusable?

To get started, here are some of the things I think about as I am going through this process:

Code

Text

  • Include at least a short textual description before each major section of the output
  • Spellcheck
  • Include any relevant references (knitcitations is useful here)

Tables

  • For tables that are small enough to be included directly in the PDF, make them look nice:
  • For large tables generated in the file, perhaps save the output to CSV and print a preview or summary of the table.

Figures

Data availability

Reusability

  • Try cloning the Github repo containing your Rmarkdown file(s) to a new machine, and see how much effort it takes to re-run the analyses.
  • Give a link to the repo to someone else and ask them to try to reproduce the results.
  • Use a dependency manager such as packrat, pacman, or conda
  • Print out yoursessionInfo() (you wrap the result in a call to pander::pander() or utils::toLatex() for cleaner output)
  • If the analysis has any external dependencies (Python, bioinformatics tools, etc.), include relevant versions for each of these as well.
  • Include a commit hash (e.g. sprintf("Git commit revision: %s", system("git rev-parse HEAD", intern=TRUE)))
  • If possible, add code to pull your data from whichever public repository you deposited it at, and load it as-needed (e.g. for SRA-hosted data, one could use the SRAdb)
  • Consider containerizing the analyses with Docker or Singularity.

Other

  • Remove unused parts of the document (commented out code, etc.)

Obviously, a lot of these same ideas apply equally well to any notebook / literate programming-based approach to document generation (Jupyter, Sweave, etc.)

What others kinds of things do you think should be considered?

rmarkdown knitr • 1.1k views
ADD COMMENT
1
Entering edit mode
Aaron Lun ★ 28k
@alun
Last seen 10 hours ago
The city by the bay

I can't resist the temptation to point out Bioconductor's workflow packages:

https://www.bioconductor.org/packages/release/BiocViews.html#_Workflow

These are continuously compiled with the latest versions of all the packages, thus ensuring that the code can always be executed on an independent system. This requires not just the analysis code to be up-to-date with respect to package functionality, it also requires that the data sources be publicly available and accessible.

A few of my workflows even have internal checks to ensure that critical results are always the same. Otherwise, the code will crash to indicate that human intervention is required to re-interpret the results. This is better than the alternative, which is a wrong result (and embarrassment when the plot doesn't match the text).

Of course, I'm not suggesting that one does this for every supplementary materials section, because - let's face it - supplementaries are pretty boring. But you can borrow these ideas to wrap Rmarkdown files in your own workflow packages and set up continuous integration, e.g., on GitHub (provided you can fit in under the time limit).

On the data side, BiocFileCache is great for dealing with publicly available resources in my workflows. And if the data set is interesting or generally useful enough, it's worth thinking about putting it onto ExperimentHub for others to use.

ADD COMMENT

Login before adding your answer.

Traffic: 358 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6