Broad-coverage biomedical relation extraction with SemRep

Halil Kilicoglu; Graciela Rosemblat; Marcelo Fiszman; Dongwook Shin

doi:10.1186/s12859-020-3517-7

Broad-coverage biomedical relation extraction with SemRep

BMC Bioinformatics. 2020 May 14;21(1):188. doi: 10.1186/s12859-020-3517-7.

Authors

Halil Kilicoglu^{1

2}, Graciela Rosemblat³, Marcelo Fiszman⁴, Dongwook Shin³

Affiliations

¹ Lister Hill National Center for Biomedical Communications, National Library of Medicine, 8600 Rockville Pike, Bethesda, 20894, MD, USA. halil@illinois.edu.
² University of Illinois at Urbana-Champaign, School of Information Sciences, 501 E Daniel Street, Champaign, 61820, IL, USA. halil@illinois.edu.
³ Lister Hill National Center for Biomedical Communications, National Library of Medicine, 8600 Rockville Pike, Bethesda, 20894, MD, USA.
⁴ Independent Researcher, Rio de Janeiro, Brazil.

Abstract

Background: In the era of information overload, natural language processing (NLP) techniques are increasingly needed to support advanced biomedical information management and discovery applications. In this paper, we present an in-depth description of SemRep, an NLP system that extracts semantic relations from PubMed abstracts using linguistic principles and UMLS domain knowledge. We also evaluate SemRep on two datasets. In one evaluation, we use a manually annotated test collection and perform a comprehensive error analysis. In another evaluation, we assess SemRep's performance on the CDR dataset, a standard benchmark corpus annotated with causal chemical-disease relationships.

Results: A strict evaluation of SemRep on our manually annotated dataset yields 0.55 precision, 0.34 recall, and 0.42 F ₁ score. A relaxed evaluation, which more accurately characterizes SemRep performance, yields 0.69 precision, 0.42 recall, and 0.52 F ₁ score. An error analysis reveals named entity recognition/normalization as the largest source of errors (26.9%), followed by argument identification (14%) and trigger detection errors (12.5%). The evaluation on the CDR corpus yields 0.90 precision, 0.24 recall, and 0.38 F ₁ score. The recall and the F ₁ score increase to 0.35 and 0.50, respectively, when the evaluation on this corpus is limited to sentence-bound relationships, which represents a fairer evaluation, as SemRep operates at the sentence level.

Conclusions: SemRep is a broad-coverage, interpretable, strong baseline system for extracting semantic relations from biomedical text. It also underpins SemMedDB, a literature-scale knowledge graph based on semantic relations. Through SemMedDB, SemRep has had significant impact in the scientific community, supporting a variety of clinical and translational applications, including clinical decision making, medical diagnosis, drug repurposing, literature-based discovery and hypothesis generation, and contributing to improved health outcomes. In ongoing development, we are redesigning SemRep to increase its modularity and flexibility, and addressing weaknesses identified in the error analysis.

Keywords: Biomedical relation extraction; Natural language processing; Scientific publications; Semantic interpretation.

MeSH terms

Algorithms*
Humans
Information Storage and Retrieval*
Natural Language Processing
PubMed
Semantics*
Unified Medical Language System

Grants and funding

Intramural/U.S. National Library of Medicine