Question: Best practices for sharing RMarkdown-based supplemental PDFs?
0
10 weeks ago by
Keith Hughitt120
United States
Keith Hughitt120 wrote:

Greetings!

As I'm going through and cleaning up some RMarkdown-based analyses in preparation for inclusion as supplemental PDF's, I thought it might be interesting to see what kinds of things people consider "best practices" for generating these documents in such a way to make them maximally useful / reusable?

To get started, here are some of the things I think about as I am going through this process:

Code

• General techniques
• A few specific ones that I often come across in published code:
• Get rid of any hard-coded filepaths (e.g. input_dir <- '/home/user/Dropbox/foo)
• Use human-readable variable names (min_read_cutoff instead of n)
• Comment your code. (Including again for emphasis...)

Text

• Include at least a short textual description before each major section of the output
• Spellcheck
• Include any relevant references (knitcitations is useful here)

Tables

• For tables that are small enough to be included directly in the PDF, make them look nice:
• For large tables generated in the file, perhaps save the output to CSV and print a preview or summary of the table.

Figures

Data availability

• Make sure all data used in the analyses are publicly accessible!
• Some possible places to share data:
• Don't just share on your home/lab website! (There is a ~90% chance it will be gone within five years..)

Reusability

• Try cloning the Github repo containing your Rmarkdown file(s) to a new machine, and see how much effort it takes to re-run the analyses.
• Give a link to the repo to someone else and ask them to try to reproduce the results.
• Use a dependency manager such as packrat, pacman, or conda
• Print out yoursessionInfo() (you wrap the result in a call to pander::pander() or utils::toLatex() for cleaner output)
• If the analysis has any external dependencies (Python, bioinformatics tools, etc.), include relevant versions for each of these as well.
• Include a commit hash (e.g. sprintf("Git commit revision: %s", system("git rev-parse HEAD", intern=TRUE)))
• If possible, add code to pull your data from whichever public repository you deposited it at, and load it as-needed (e.g. for SRA-hosted data, one could use the SRAdb)
• Consider containerizing the analyses with Docker or Singularity.

Other

• Remove unused parts of the document (commented out code, etc.)

Obviously, a lot of these same ideas apply equally well to any notebook / literate programming-based approach to document generation (Jupyter, Sweave, etc.)

What others kinds of things do you think should be considered?

knitr rmarkdown • 104 views
modified 10 weeks ago by Aaron Lun23k • written 10 weeks ago by Keith Hughitt120
Answer: Best practices for sharing RMarkdown-based supplemental PDFs?
1
10 weeks ago by
Aaron Lun23k
Cambridge, United Kingdom
Aaron Lun23k wrote:

I can't resist the temptation to point out Bioconductor's workflow packages:

These are continuously compiled with the latest versions of all the packages, thus ensuring that the code can always be executed on an independent system. This requires not just the analysis code to be up-to-date with respect to package functionality, it also requires that the data sources be publicly available and accessible.

A few of my workflows even have internal checks to ensure that critical results are always the same. Otherwise, the code will crash to indicate that human intervention is required to re-interpret the results. This is better than the alternative, which is a wrong result (and embarrassment when the plot doesn't match the text).

Of course, I'm not suggesting that one does this for every supplementary materials section, because - let's face it - supplementaries are pretty boring. But you can borrow these ideas to wrap Rmarkdown files in your own workflow packages and set up continuous integration, e.g., on GitHub (provided you can fit in under the time limit).

On the data side, BiocFileCache is great for dealing with publicly available resources in my workflows. And if the data set is interesting or generally useful enough, it's worth thinking about putting it onto ExperimentHub for others to use.