Automatic Detection of Metadata Errors in a Registry of Clinical Studies Using Shapes Constraint Language (SHACL) Graphs

Stud Health Technol Inform. 2021 May 27:281:372-376. doi: 10.3233/SHTI210183.

Abstract

Registries of clinical studies such as ClinicalTrials.gov are an important source of information. However, the process of manually entering metadata is prone to errors which impedes their use and thereby the overall usefulness of the registry. In this work, we propose a generic approach towards detection of errors in the metadata by using the Shapes Constraint Language for defining rule templates covering constraints regarding value type and cardinality. We developed a Python 3 algorithm for the automatic validation of 15 rule instances applied to the whole ClinicalTrials.gov database (355,862 studies; 27th October 2020) resulting in more than 5 million metadata verifications. Our results show a large number of errors in different metadata fields, such as i) missing values, ii) values not coming from a predefined set or iii) wrong cardinalities, can be detected using this approach. Since 2015 approximately 5% of all studies contain one or more errors. In the future, we will apply this technique to other registries and develop more complex rules by focusing on the semantics of the metadata. This could render the possibility of automatically correcting entries, increasing the value of registries of clinical studies.

Keywords: Big data; Clinical studies; ClinicalTrials.gov; Data quality; Metadata.

MeSH terms

  • Databases, Factual
  • Language*
  • Metadata*
  • Registries
  • Semantics