Skip to main page content
Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2015 Sep 18;2015:bav089.
doi: 10.1093/database/bav089. Print 2015.

SORTA: A System for Ontology-Based Re-Coding and Technical Annotation of Biomedical Phenotype Data

Affiliations
Free PMC article

SORTA: A System for Ontology-Based Re-Coding and Technical Annotation of Biomedical Phenotype Data

Chao Pang et al. Database (Oxford). .
Free PMC article

Abstract

There is an urgent need to standardize the semantics of biomedical data values, such as phenotypes, to enable comparative and integrative analyses. However, it is unlikely that all studies will use the same data collection protocols. As a result, retrospective standardization is often required, which involves matching of original (unstructured or locally coded) data to widely used coding or ontology systems such as SNOMED CT (clinical terms), ICD-10 (International Classification of Disease) and HPO (Human Phenotype Ontology). This data curation process is usually a time-consuming process performed by a human expert. To help mechanize this process, we have developed SORTA, a computer-aided system for rapidly encoding free text or locally coded values to a formal coding system or ontology. SORTA matches original data values (uploaded in semicolon delimited format) to a target coding system (uploaded in Excel spreadsheet, OWL ontology web language or OBO open biomedical ontologies format). It then semi- automatically shortlists candidate codes for each data value using Lucene and n-gram based matching algorithms, and can also learn from matches chosen by human experts. We evaluated SORTA's applicability in two use cases. For the LifeLines biobank, we used SORTA to recode 90 000 free text values (including 5211 unique values) about physical exercise to MET (Metabolic Equivalent of Task) codes. For the CINEAS clinical symptom coding system, we used SORTA to map to HPO, enriching HPO when necessary (315 terms matched so far). Out of the shortlists at rank 1, we found a precision/recall of 0.97/0.98 in LifeLines and of 0.58/0.45 in CINEAS. More importantly, users found the tool both a major time saver and a quality improvement because SORTA reduced the chances of human mistakes. Thus, SORTA can dramatically ease data (re)coding tasks and we believe it will prove useful for many more projects. Database URL: http://molgenis.org/sorta or as an open source download from http://www.molgenis.org/wiki/SORTA.

Figures

Figure 1.
Figure 1.
SORTA overview. The desired coding system or ontology can be uploaded in OWL/OBO and Excel and indexed for fast matching searches. Data values can be uploaded and then automatically matched with the indexed ontology using Lucene. A list of the most relevant concepts is retrieved from the index and matching percentages are calculated using the n-gram algorithm so that users can easily evaluate the matching score. Users can choose the mappings from the suggested list.
Figure 2.
Figure 2.
Example of coding a physical activity. A list of MET codes was matched with input and sorted based on similarity scores, from which the proper code can be selected to recode the input. If none of the candidate codes is suitable, users can either search for codes manually or decide to use ‘Unknown code’. If the button ‘Code data’ is clicked, the input is recoded only with the selected code. If the button ‘Code and add’ is clicked, the input is recoded and the input gets added to the code as a new synonym. The example is a typo of the Dutch word for ‘swimming’. zwemmen = swimming, zwemmen 2x = twice a week, soms zwemmen = occasional swimming, gym-zwemmen = water gym.
Figure 3.
Figure 3.
Receiver operating characteristic (ROC) curves evaluating performance on LifeLines data. Blue represents the performance before the researcher recoded all the LifeLines data. During coding, the researcher introduced new knowledge to the database and if a similar dataset was uploaded again (e.g. second rounds of the same questionnaire), the coding performance greatly improved as shown by the red curve.
Figure 4.
Figure 4.
Example of matching the input value ‘external auditory canal defect’ with HPO ontology terms. A list of candidate HPO ontology terms was retrieved from the index and sorted based on similarity scores. Users can select a mapping by clicking the ‘v’ button. If none of the candidate mappings are suitable, users can choose the ‘No match’ option.
Figure 5.
Figure 5.
Performance comparison for matching HPO terms among three algorithms. Lucene (blue line), combination of Lucene + n-gram (red) and combination of Lucene + n-gram + inverse document frequency (green).

Similar articles

See all similar articles

Cited by 5 articles

References

    1. BioShaRE (2011) BioSHaRE project. https://www.bioshare.eu/
    1. Pang,C., Hendriksen,D., Dijkstra,M. et al. (2015) BiobankConnect: software to rapidly connect data elements for pooled analysis across biobanks using ontological and lexical indexing. doi:10.1136/amiajnl-2013-002577. - PMC - PubMed
    1. Poggi A., Lembo D., Calvanese D., et al. (2008) Linking data to ontologies. Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics). Springer, Vol. 4900 LNCS, pp. 133–173.
    1. Rubin D.L., Shah N.H., Noy N.F. (2008) Biomedical ontologies: a functional perspective. Brief. Bioinf., 9, 75–90. - PubMed
    1. Scholtens S., Smidt N., Swertz M.A., et al. (2014) Cohort Profile: LifeLines, a three-generation cohort study and biobank. Int. J. Epidemiol., doi:10.1093/ije/dyu229. - PubMed

Publication types

Feedback