Coreference analysis in clinical notes: a multi-pass sieve with alternate anaphora resolution modules

J Am Med Inform Assoc. 2012 Sep-Oct;19(5):867-74. doi: 10.1136/amiajnl-2011-000766. Epub 2012 Jun 16.

Abstract

Objective: This paper describes the coreference resolution system submitted by Mayo Clinic for the 2011 i2b2/VA/Cincinnati shared task Track 1C. The goal of the task was to construct a system that links the markables corresponding to the same entity.

Materials and methods: The task organizers provided progress notes and discharge summaries that were annotated with the markables of treatment, problem, test, person, and pronoun. We used a multi-pass sieve algorithm that applies deterministic rules in the order of preciseness and simultaneously gathers information about the entities in the documents. Our system, MedCoref, also uses a state-of-the-art machine learning framework as an alternative to the final, rule-based pronoun resolution sieve.

Results: The best system that uses a multi-pass sieve has an overall score of 0.836 (average of B(3), MUC, Blanc, and CEAF F score) for the training set and 0.843 for the test set.

Discussion: A supervised machine learning system that typically uses a single function to find coreferents cannot accommodate irregularities encountered in data especially given the insufficient number of examples. On the other hand, a completely deterministic system could lead to a decrease in recall (sensitivity) when the rules are not exhaustive. The sieve-based framework allows one to combine reliable machine learning components with rules designed by experts.

Conclusion: Using relatively simple rules, part-of-speech information, and semantic type properties, an effective coreference resolution system could be designed. The source code of the system described is available at https://sourceforge.net/projects/ohnlp/files/MedCoref.

Publication types

  • Evaluation Study
  • Multicenter Study
  • Research Support, N.I.H., Extramural
  • Research Support, U.S. Gov't, Non-P.H.S.

MeSH terms

  • Algorithms
  • Artificial Intelligence*
  • Data Mining / methods*
  • Decision Support Systems, Clinical*
  • Electronic Health Records*
  • Humans
  • Markov Chains
  • Natural Language Processing*
  • Semantics
  • Sensitivity and Specificity
  • United States