Greetings!
As I'm going through and cleaning up some RMarkdown-based analyses in preparation for inclusion as supplemental PDF's, I thought it might be interesting to see what kinds of things people consider "best practices" for generating these documents in such a way to make them maximally useful / reusable?
To get started, here are some of the things I think about as I am going through this process:
Code
- General techniques
- Apply Coding best practices.
- Adopt a coding style guide
- Comment your code.
- Have someone else review your code
- Create a Github repo to go along with the manuscript and upload your Rmd code, etc. there.
- Include a link to the Github repo in the document itself.
- A few specific ones that I often come across in published code:
- Get rid of any hard-coded filepaths (e.g.
input_dir <- '/home/user/Dropbox/foo
) - Use human-readable variable names (
min_read_cutoff
instead ofn
) - Comment your code. (Including again for emphasis...)
- Get rid of any hard-coded filepaths (e.g.
Text
- Include at least a short textual description before each major section of the output
- Spellcheck
- Include any relevant references (knitcitations is useful here)
Tables
- For tables that are small enough to be included directly in the PDF, make them look nice:
- For large tables generated in the file, perhaps save the output to CSV and print a preview or summary of the table.
Figures
- Render figures at a high DPI
- Label axes
- Make sure all text is legible
- Use a colorblind friendly palette.
- Some more figure recommendations here and here
Data availability
- Make sure all data used in the analyses are publicly accessible!
- Some possible places to share data:
- Sequence Read Archive (SRA)
- European Genome-phenome Archive (EGA)
- ArrayExpress
- Bioconductor ExperimentData Packages
- FigShare
- Some further suggestions here.
- Don't just share on your home/lab website! (There is a ~90% chance it will be gone within five years..)
Reusability
- Try cloning the Github repo containing your Rmarkdown file(s) to a new machine, and see how much effort it takes to re-run the analyses.
- Give a link to the repo to someone else and ask them to try to reproduce the results.
- Use a dependency manager such as packrat, pacman, or conda
- Print out your
sessionInfo()
(you wrap the result in a call to pander::pander() orutils::toLatex()
for cleaner output) - If the analysis has any external dependencies (Python, bioinformatics tools, etc.), include relevant versions for each of these as well.
- Include a commit hash (e.g.
sprintf("Git commit revision: %s", system("git rev-parse HEAD", intern=TRUE))
) - If possible, add code to pull your data from whichever public repository you deposited it at, and load it as-needed (e.g. for SRA-hosted data, one could use the SRAdb)
- Consider containerizing the analyses with Docker or Singularity.
Other
- Remove unused parts of the document (commented out code, etc.)
Obviously, a lot of these same ideas apply equally well to any notebook / literate programming-based approach to document generation (Jupyter, Sweave, etc.)
What others kinds of things do you think should be considered?