Democratizing Data Science Through Data Science Training

Pac Symp Biocomput. 2018;23:292-303.

Abstract

The biomedical sciences have experienced an explosion of data which promises to overwhelm many current practitioners. Without easy access to data science training resources, biomedical researchers may find themselves unable to wrangle their own datasets. In 2014, to address the challenges posed such a data onslaught, the National Institutes of Health (NIH) launched the Big Data to Knowledge (BD2K) initiative. To this end, the BD2K Training Coordinating Center (TCC; bigdatau.org) was funded to facilitate both in-person and online learning, and open up the concepts of data science to the widest possible audience. Here, we describe the activities of the BD2K TCC and its focus on the construction of the Educational Resource Discovery Index (ERuDIte), which identifies, collects, describes, and organizes online data science materials from BD2K awardees, open online courses, and videos from scientific lectures and tutorials. ERuDIte now indexes over 9,500 resources. Given the richness of online training materials and the constant evolution of biomedical data science, computational methods applying information retrieval, natural language processing, and machine learning techniques are required - in effect, using data science to inform training in data science. In so doing, the TCC seeks to democratize novel insights and discoveries brought forth via large-scale data science training.

Publication types

  • Research Support, N.I.H., Extramural

MeSH terms

  • Computational Biology / education*
  • Computational Biology / standards
  • Data Mining
  • Education, Distance / methods
  • Humans
  • Information Storage and Retrieval
  • Internet
  • Machine Learning
  • Metadata / standards
  • National Institutes of Health (U.S.)
  • Natural Language Processing
  • United States