The Essential Toolbox of Data Science: Python, R, Git, and Docker

Methods Mol Biol. 2020;2104:265-311. doi: 10.1007/978-1-0716-0239-3_15.


The daily work in data science involves a set of essential tools: the programming languages Python and R, the version control tool Git and the virtualization tool Docker. Proficiency in at least one programming language is required for data science. R is tied to a computing environment that focuses on statistics, in which many new algorithms in genomics and biomedicine are first published. Python has a root in system administration, and is a superb language for general programming. Version control is critical to managing complex projects, even if software development is not involved. Docker container is becoming a key tool for deployment, portability, and reproducibility. This chapter provides a self-contained practical guide of these topics so that readers can use it as a reference and to plan their training.

Keywords: Bioinformatics; Data science; Docker; Git; Python; R; Version control; Virtualization.

Publication types

  • Research Support, N.I.H., Extramural
  • Research Support, U.S. Gov't, Non-P.H.S.

MeSH terms

  • Computational Biology / methods*
  • Data Science* / methods
  • Database Management Systems
  • Programming Languages
  • Software*
  • User-Computer Interface
  • Web Browser