Skip to main page content
Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2015 Jan;22(1):65-75.
doi: 10.1136/amiajnl-2013-002577. Epub 2014 Oct 31.

BiobankConnect: Software to Rapidly Connect Data Elements for Pooled Analysis Across Biobanks Using Ontological and Lexical Indexing

Affiliations
Free PMC article

BiobankConnect: Software to Rapidly Connect Data Elements for Pooled Analysis Across Biobanks Using Ontological and Lexical Indexing

Chao Pang et al. J Am Med Inform Assoc. .
Free PMC article

Abstract

Objective: Pooling data across biobanks is necessary to increase statistical power, reveal more subtle associations, and synergize the value of data sources. However, searching for desired data elements among the thousands of available elements and harmonizing differences in terminology, data collection, and structure, is arduous and time consuming.

Materials and methods: To speed up biobank data pooling we developed BiobankConnect, a system to semi-automatically match desired data elements to available elements by: (1) annotating the desired elements with ontology terms using BioPortal; (2) automatically expanding the query for these elements with synonyms and subclass information using OntoCAT; (3) automatically searching available elements for these expanded terms using Lucene lexical matching; and (4) shortlisting relevant matches sorted by matching score.

Results: We evaluated BiobankConnect using human curated matches from EU-BioSHaRE, searching for 32 desired data elements in 7461 available elements from six biobanks. We found 0.75 precision at rank 1 and 0.74 recall at rank 10 compared to a manually curated set of relevant matches. In addition, best matches chosen by BioSHaRE experts ranked first in 63.0% and in the top 10 in 98.4% of cases, indicating that our system has the potential to significantly reduce manual matching work.

Conclusions: BiobankConnect provides an easy user interface to significantly speed up the biobank harmonization process. It may also prove useful for other forms of biomedical data integration. All the software can be downloaded as a MOLGENIS open source app from http://www.github.com/molgenis, with a demo available at http://www.biobankconnect.org.

Keywords: Biobank; Data integration; Harmonization; Search.

Figures

Figure 1:
Figure 1:
Harmonization process. Many studies need to pool data in order to reach sufficient statistical power, however matching data elements of interest to the available data elements is a daunting task.
Figure 2:
Figure 2:
Example of query expansion. ‘Parental diabetes mellitus’ is annotated with the ontology terms ‘Parental’ and ‘Diabetes mellitus.’ Then the terms are expanded based on synonyms, resulting in three terms for ‘Diabetes mellitus’ and three terms for ‘Parental,’ so all 3 × 3 = 9 combinations are used for the search (only four are shown here).
Figure 3:
Figure 3:
Overview of BiobankConnect. Data elements of interest (target) are matched to all available data elements (source), based on knowledge from the ontology terms.
Figure 4:
Figure 4:
Matching results produced by BiobankConnect. (A) Matching data elements for ‘Parental diabetes mellitus’ in Prevend. The gold standard matches are two data elements, V57A_1 and V57B_1, located in the second and third positions. (B) The matching data element for ‘History of hypertension’ in the NCDS database. The best match in the experts’ opinion is ‘downhibp,’ located in the first position on the candidate list. CM, cohort member.
Figure 5:
Figure 5:
Receiver operating characteristic (ROC) curve. Matching performance for 32 data elements in five different biobanks. Note that BiobankConnect only retrieves a subset of data elements based on the semantic/lexical similarity queries, therefore the ROC curves end before reaching 1.00, 1.00. For the remaining data elements we simulated a line of non-discrimination, indicated by dotted lines.

Similar articles

See all similar articles

Cited by 10 articles

See all "Cited by" articles

References

    1. Fortier I, Doiron D, Little J, et al. Is rigorous retrospective harmonization possible? Application of the DataSHaPER approach across 53 large studies. Int J Epidemiol 2011;40:1314–28. - PMC - PubMed
    1. Fortier I, Burton PR, Robson PJ, et al. Quality, quantity and harmony: the DataSHaPER approach to integrating data across bioclinical studies. Int J Epidemiol 2010;39:1383–93. - PMC - PubMed
    1. Euzenat J, Shvaiko P. Ontology Matching. 2nd edn. Berlin: Springer, 2013:333 http://www.springer.com/computer/database+management+&+information+retrieval/book/978-3-642-38720-3
    1. Abbasi A, Corpeleijn E. External validation of the KORA S4/F4 prediction models for the risk of developing type 2 diabetes in older adults: the PREVEND study. Eur J Epidemiol 2012;27:47–52. - PubMed
    1. Aleksovski Z, Klein M, Ten Kate W, et al. Matching unstructured vocabularies using a background ontology. Lect Notes Comput Sci 2006;4248:182–97.

Publication types

Feedback