Scientific Reproducibility in Biomedical Research: Provenance Metadata Ontology for Semantic Annotation of Study Description

AMIA Annu Symp Proc. 2017 Feb 10;2016:1070-1079. eCollection 2016.


Scientific reproducibility is key to scientific progress as it allows the research community to build on validated results, protect patients from potentially harmful trial drugs derived from incorrect results, and reduce wastage of valuable resources. The National Institutes of Health (NIH) recently published a systematic guideline titled "Rigor and Reproducibility " for supporting reproducible research studies, which has also been accepted by several scientific journals. These journals will require published articles to conform to these new guidelines. Provenance metadata describes the history or origin of data and it has been long used in computer science to capture metadata information for ensuring data quality and supporting scientific reproducibility. In this paper, we describe the development of Provenance for Clinical and healthcare Research (ProvCaRe) framework together with a provenance ontology to support scientific reproducibility by formally modeling a core set of data elements representing details of research study. We extend the PROV Ontology (PROV-O), which has been recommended as the provenance representation model by World Wide Web Consortium (W3C), to represent both: (a) data provenance, and (b) process provenance. We use 124 study variables from 6 clinical research studies from the National Sleep Research Resource (NSRR) to evaluate the coverage of the provenance ontology. NSRR is the largest repository of NIH-funded sleep datasets with 50,000 studies from 36,000 participants. The provenance ontology reuses ontology concepts from existing biomedical ontologies, for example the Systematized Nomenclature of Medicine Clinical Terms (SNOMED CT), to model the provenance information of research studies. The ProvCaRe framework is being developed as part of the Big Data to Knowledge (BD2K) data provenance project.

MeSH terms

  • Biological Ontologies*
  • Biomedical Research / standards*
  • Databases as Topic
  • Humans
  • Metadata* / standards
  • National Institutes of Health (U.S.)
  • Reproducibility of Results*
  • Semantics
  • Sleep
  • Sleep Wake Disorders
  • United States