Using logical constraints to validate statistical information about disease outbreaks in collaborative knowledge graphs: the case of COVID-19 epidemiology in Wikidata

PeerJ Comput Sci. 2022 Sep 29:8:e1085. doi: 10.7717/peerj-cs.1085. eCollection 2022.

Abstract

Urgent global research demands real-time dissemination of precise data. Wikidata, a collaborative and openly licensed knowledge graph available in RDF format, provides an ideal forum for exchanging structured data that can be verified and consolidated using validation schemas and bot edits. In this research article, we catalog an automatable task set necessary to assess and validate the portion of Wikidata relating to the COVID-19 epidemiology. These tasks assess statistical data and are implemented in SPARQL, a query language for semantic databases. We demonstrate the efficiency of our methods for evaluating structured non-relational information on COVID-19 in Wikidata, and its applicability in collaborative ontologies and knowledge graphs more broadly. We show the advantages and limitations of our proposed approach by comparing it to the features of other methods for the validation of linked web data as revealed by previous research.

Keywords: COVID-19 epidemiology; Collaborative curation; Data quality; Knowledge graph refinement; Public Health Emergency of International Concern; Public health surveillance; SPARQL; Shape Expressions; Validation constraints; Wikidata.

Grants and funding

The work done by Houcemeddine Turki, Mohamed Ali Hadj Taieb, and Mohamed Ben Aouicha was supported by the Ministry of Higher Education and Scientific Research in Tunisia (MoHESR) in the framework of Federated Research Project PRFCOV19-D1-P1, by the Wikimedia Foundation through a rapid grant, and by the WikiCred Grants Initiative of Craig Newmark Philanthropies, Facebook, and Microsoft. The work done by Jose Emilio Labra Gayo was funded by the Spanish Ministry of Economy and Competitiveness (Society challenges: TIN2017-88877-R). The work done by Daniel Mietchen was supported by the Alfred P. Sloan Foundation under grant numbers G-2019-11458 and G-2021-17106. The work done by Dariusz Jemielniak was funded by the Polish National Science Center (Grant No. 2019/35/B/HS6/01056). The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.