MOLGENIS/connect: a system for semi-automatic integration of heterogeneous phenotype data with applications in biobanks

Chao Pang; David van Enckevort; Mark de Haan; Fleur Kelpin; Jonathan Jetten; Dennis Hendriksen; Tommy de Boer; Bart Charbon; Erwin Winder; K Joeri van der Velde; Dany Doiron; Isabel Fortier; Hans Hillege; Morris A Swertz

doi:10.1093/bioinformatics/btw155

MOLGENIS/connect: a system for semi-automatic integration of heterogeneous phenotype data with applications in biobanks

Bioinformatics. 2016 Jul 15;32(14):2176-83. doi: 10.1093/bioinformatics/btw155. Epub 2016 Mar 21.

Affiliations

¹ Department of Genetics, University Medical Center Groningen, Genomics Coordination Center, University of Groningen, Groningen, The Netherlands Department of Epidemiology, University Medical Center Groningen, University of Groningen, Groningen, The Netherlands.
² Department of Genetics, University Medical Center Groningen, Genomics Coordination Center, University of Groningen, Groningen, The Netherlands.
³ Research Institute of the McGill University Health Centre and Department of Medicine, McGill University, Montreal, Canada.
⁴ Department of Epidemiology, University Medical Center Groningen, University of Groningen, Groningen, The Netherlands.

Abstract

Motivation: While the size and number of biobanks, patient registries and other data collections are increasing, biomedical researchers still often need to pool data for statistical power, a task that requires time-intensive retrospective integration.

Results: To address this challenge, we developed MOLGENIS/connect, a semi-automatic system to find, match and pool data from different sources. The system shortlists relevant source attributes from thousands of candidates using ontology-based query expansion to overcome variations in terminology. Then it generates algorithms that transform source attributes to a common target DataSchema. These include unit conversion, categorical value matching and complex conversion patterns (e.g. calculation of BMI). In comparison to human-experts, MOLGENIS/connect was able to auto-generate 27% of the algorithms perfectly, with an additional 46% needing only minor editing, representing a reduction in the human effort and expertise needed to pool data.

Availability and implementation: Source code, binaries and documentation are available as open-source under LGPLv3 from http://github.com/molgenis/molgenis and www.molgenis.org/connect

Contact: : m.a.swertz@rug.nl

Supplementary information: Supplementary data are available at Bioinformatics online.

MeSH terms

Algorithms
Biological Ontologies
Biological Specimen Banks*
Computational Biology / methods*
Humans
Phenotype*
Software*