Question

Best practices for sharing RMarkdown-based supplemental PDFs?

0

Entering edit mode

Keith Hughitt ▴ 180

@keith-hughitt-6740

Last seen 12 weeks ago

United States

Greetings!

As I'm going through and cleaning up some RMarkdown-based analyses in preparation for inclusion as supplemental PDF's, I thought it might be interesting to see what kinds of things people consider "best practices" for generating these documents in such a way to make them maximally useful / reusable?

To get started, here are some of the things I think about as I am going through this process:

Code

General techniques
- Apply Coding best practices.
- Adopt a coding style guide
- Comment your code.
- Have someone else review your code
- Create a Github repo to go along with the manuscript and upload your Rmd code, etc. there.
- Include a link to the Github repo in the document itself.
A few specific ones that I often come across in published code:
- Get rid of any hard-coded filepaths (e.g. input_dir <- '/home/user/Dropbox/foo)
- Use human-readable variable names (min_read_cutoff instead of n)
- Comment your code. (Including again for emphasis...)

Text

Include at least a short textual description before each major section of the output
Spellcheck
Include any relevant references (knitcitations is useful here)

Tables

For tables that are small enough to be included directly in the PDF, make them look nice:
- kable
- kableExtra
- xtable
For large tables generated in the file, perhaps save the output to CSV and print a preview or summary of the table.

Figures

Render figures at a high DPI
Label axes
Make sure all text is legible
Use a colorblind friendly palette.
Some more figure recommendations here and here

Data availability

Make sure all data used in the analyses are publicly accessible!
Some possible places to share data:
- Sequence Read Archive (SRA)
- European Genome-phenome Archive (EGA)
- ArrayExpress
- Bioconductor ExperimentData Packages
- FigShare
- Some further suggestions here.
Don't just share on your home/lab website! (There is a ~90% chance it will be gone within five years..)

Reusability

Try cloning the Github repo containing your Rmarkdown file(s) to a new machine, and see how much effort it takes to re-run the analyses.
Give a link to the repo to someone else and ask them to try to reproduce the results.
Use a dependency manager such as packrat, pacman, or conda
Print out yoursessionInfo() (you wrap the result in a call to pander::pander() or utils::toLatex() for cleaner output)
If the analysis has any external dependencies (Python, bioinformatics tools, etc.), include relevant versions for each of these as well.
Include a commit hash (e.g. sprintf("Git commit revision: %s", system("git rev-parse HEAD", intern=TRUE)))
If possible, add code to pull your data from whichever public repository you deposited it at, and load it as-needed (e.g. for SRA-hosted data, one could use the SRAdb)
Consider containerizing the analyses with Docker or Singularity.

Other

Remove unused parts of the document (commented out code, etc.)

Obviously, a lot of these same ideas apply equally well to any notebook / literate programming-based approach to document generation (Jupyter, Sweave, etc.)

What others kinds of things do you think should be considered?

rmarkdown knitr • 1.1k views

ADD COMMENT • link updated 5.3 years ago by Aaron Lun ★ 28k • written 5.3 years ago by Keith Hughitt ▴ 180

score 1 · Answer 1 · 2019-02-07

I can't resist the temptation to point out Bioconductor's workflow packages:

https://www.bioconductor.org/packages/release/BiocViews.html#_Workflow

These are continuously compiled with the latest versions of all the packages, thus ensuring that the code can always be executed on an independent system. This requires not just the analysis code to be up-to-date with respect to package functionality, it also requires that the data sources be publicly available and accessible.

A few of my workflows even have internal checks to ensure that critical results are always the same. Otherwise, the code will crash to indicate that human intervention is required to re-interpret the results. This is better than the alternative, which is a wrong result (and embarrassment when the plot doesn't match the text).

Of course, I'm not suggesting that one does this for every supplementary materials section, because - let's face it - supplementaries are pretty boring. But you can borrow these ideas to wrap Rmarkdown files in your own workflow packages and set up continuous integration, e.g., on GitHub (provided you can fit in under the time limit).

On the data side, BiocFileCache is great for dealing with publicly available resources in my workflows. And if the data set is interesting or generally useful enough, it's worth thinking about putting it onto ExperimentHub for others to use.