Facilitating clinical research through automation: Combining optical character recognition with natural language processing

Clin Trials. 2022 Oct;19(5):504-511. doi: 10.1177/17407745221093621. Epub 2022 May 24.


Background/aims: Performance status is crucial for most clinical research, as an eligibility criterion, a comorbidity covariate, or a trial endpoint. Yet information on performance status often is embedded as free text within a patient's electronic medical record, rather than coded directly, thereby making this concept extremely difficult to extract for research. Furthermore, performance status information frequently resides in outside reports, which are scanned into the electronic medical record along with thousands of clinic notes. The image format of scanned documents also is a major obstacle to the search and retrieval of information, as natural language processing cannot be applied to unstructured text within an image. We, therefore, utilized optical character recognition software to convert images to a searchable format, allowing the application of natural language processing to identify pertinent performance status data elements within scanned electronic medical records.

Methods: Our study cohort consisted of 189 subjects diagnosed with diffuse large B-cell lymphoma for whom performance status was a required data element for analysis of prognostic factors related to recurrence and survival. Manual abstraction of performance status was previously conducted by a clinical Subject Matter Expert, serving as the gold standard. Leveraging our data warehouse, we extracted relevant scanned electronic medical record documents and applied optical character recognition to these images using the ABBYY FineReader software. The Linguamatics i2e natural language processing software was then used to run queries for performance status against the corpus of electronic medical record documents. We evaluated our optical character recognition/natural language processing pipeline for accuracy and reduction in data extraction effort.

Results: We found that there was high accuracy and reduced time for extraction of performance status data by applying our optical character recognition/natural language processing pipeline. The transformed scanned documents from a random sample of patients yielded excellent precision, recall, and F score, with <1% incorrect results. Time savings from a second cohort showed that median time to review documents for patients with performance status data present was reduced by a third. The major time savings was in the review of those documents that in fact did not contain performance status information: median of 18 minutes versus 108 minutes for manual review, an 83% reduction in data abstraction effort.

Conclusion: By applying this optical character recognition/natural language processing pipeline, we achieved significant operational improvement and reduced time for information retrieval to support clinical research. Our study demonstrated that optical character recognition software provides an effective mechanism to transform scanned electronic medical record images to allow the application of natural language processing, yielding highly accurate data abstraction. We conclude that our optical character recognition/natural language processing pipeline can greatly facilitate research data abstraction by providing a highly focused data review, eliminating unnecessary manual review of the entire chart, and thus freeing time for abstracting other data elements requiring more human interpretation.

Keywords: Eastern Cooperative Oncology Group; Karnofsky performance status; Scanned medical records; natural language processing; optical character recognition; performance status.

MeSH terms

  • Automation
  • Clinical Trials as Topic
  • Electronic Health Records
  • Humans
  • Information Storage and Retrieval*
  • Natural Language Processing*
  • Software