Language-Agnostic Reproducible Data Analysis Using Literate Programming

PLoS One. 2016 Oct 6;11(10):e0164023. doi: 10.1371/journal.pone.0164023. eCollection 2016.

Abstract

A modern biomedical research project can easily contain hundreds of analysis steps and lack of reproducibility of the analyses has been recognized as a severe issue. While thorough documentation enables reproducibility, the number of analysis programs used can be so large that in reality reproducibility cannot be easily achieved. Literate programming is an approach to present computer programs to human readers. The code is rearranged to follow the logic of the program, and to explain that logic in a natural language. The code executed by the computer is extracted from the literate source code. As such, literate programming is an ideal formalism for systematizing analysis steps in biomedical research. We have developed the reproducible computing tool Lir (literate, reproducible computing) that allows a tool-agnostic approach to biomedical data analysis. We demonstrate the utility of Lir by applying it to a case study. Our aim was to investigate the role of endosomal trafficking regulators to the progression of breast cancer. In this analysis, a variety of tools were combined to interpret the available data: a relational database, standard command-line tools, and a statistical computing environment. The analysis revealed that the lipid transport related genes LAPTM4B and NDRG1 are coamplified in breast cancer patients, and identified genes potentially cooperating with LAPTM4B in breast cancer progression. Our case study demonstrates that with Lir, an array of tools can be combined in the same data analysis to improve efficiency, reproducibility, and ease of understanding. Lir is an open-source software available at github.com/borisvassilev/lir.

MeSH terms

  • Biological Transport
  • Breast Neoplasms / genetics
  • Breast Neoplasms / pathology
  • Cell Cycle Proteins / genetics
  • Computational Biology / methods*
  • Endosomes / metabolism
  • Humans
  • Intracellular Signaling Peptides and Proteins / genetics
  • Lipid Metabolism
  • Membrane Proteins / genetics
  • Oncogene Proteins / genetics
  • RNA, Messenger / genetics
  • RNA, Messenger / metabolism
  • Software*

Substances

  • Cell Cycle Proteins
  • Intracellular Signaling Peptides and Proteins
  • LAPTM4B protein, human
  • Membrane Proteins
  • N-myc downstream-regulated gene 1 protein
  • Oncogene Proteins
  • RNA, Messenger

Grants and funding

BV received personal grants by the Finnish Society of Science and Letters (http://www.scientiarum.fi/eng/); the Biomedicum Helsinki Foundation (http://www.biomedicum.com/index.php?page=112&lang=2); The Paulon Säätiö (http://www.paulo.fi/); the K. Albin Johanssons Stiftelse (http://www.foundationweb.net/johansson/); and the Ida Montinin Säätiö (www.idamontininsaatio.fi/). RL received no specific funding for this work. EI received grant 282192 by the Academy of Finland (www.aka.fi). SH received funding by Biocentrum Helsinki (http://www.helsinki.fi/biocentrum/). The funders had no role in the study design, data collection and analysis, decision to publish, or preparation of the manuscript.