Multi-domain clinical natural language processing with MedCAT: The Medical Concept Annotation Toolkit

Zeljko Kraljevic; Thomas Searle; Anthony Shek; Lukasz Roguski; Kawsar Noor; Daniel Bean; Aurelie Mascio; Leilei Zhu; Amos A Folarin; Angus Roberts; Rebecca Bendayan; Mark P Richardson; Robert Stewart; Anoop D Shah; Wai Keong Wong; Zina Ibrahim; James T Teo; Richard J B Dobson

doi:10.1016/j.artmed.2021.102083

Multi-domain clinical natural language processing with MedCAT: The Medical Concept Annotation Toolkit

Artif Intell Med. 2021 Jul:117:102083. doi: 10.1016/j.artmed.2021.102083. Epub 2021 May 1.

Authors

Zeljko Kraljevic¹, Thomas Searle², Anthony Shek³, Lukasz Roguski⁴, Kawsar Noor⁴, Daniel Bean⁵, Aurelie Mascio², Leilei Zhu⁶, Amos A Folarin⁷, Angus Roberts⁸, Rebecca Bendayan², Mark P Richardson³, Robert Stewart⁹, Anoop D Shah⁴, Wai Keong Wong⁶, Zina Ibrahim¹, James T Teo¹⁰, Richard J B Dobson¹¹

Affiliations

¹ Department of Biostatistics and Health Informatics, Institute of Psychiatry, Psychology and Neuroscience, King's College London, London, UK.
² Department of Biostatistics and Health Informatics, Institute of Psychiatry, Psychology and Neuroscience, King's College London, London, UK; NIHR Biomedical Research Centre at South London and Maudsley NHS Foundation Trust and King's College London, London, UK.
³ Department of Clinical Neuroscience, Institute of Psychiatry, Psychology and Neuroscience, King's College London, London, UK.
⁴ Health Data Research UK London, University College London, London, UK; Institute of Health Informatics, University College London, London, UK; NIHR BRC Clinical Research Informatics Unit, University College London Hospitals, NHS Foundation Trust, London, UK.
⁵ Department of Biostatistics and Health Informatics, Institute of Psychiatry, Psychology and Neuroscience, King's College London, London, UK; Health Data Research UK London, University College London, London, UK.
⁶ Institute of Health Informatics, University College London, London, UK; NIHR BRC Clinical Research Informatics Unit, University College London Hospitals, NHS Foundation Trust, London, UK.
⁷ Department of Biostatistics and Health Informatics, Institute of Psychiatry, Psychology and Neuroscience, King's College London, London, UK; Institute of Health Informatics, University College London, London, UK; NIHR Biomedical Research Centre at South London and Maudsley NHS Foundation Trust and King's College London, London, UK.
⁸ Department of Biostatistics and Health Informatics, Institute of Psychiatry, Psychology and Neuroscience, King's College London, London, UK; Health Data Research UK London, University College London, London, UK; NIHR Biomedical Research Centre at South London and Maudsley NHS Foundation Trust and King's College London, London, UK.
⁹ Department of Psychological Medicine, Institute of Psychiatry, Psychology and Neuroscience, King's College London, London, UK; NIHR Biomedical Research Centre at South London and Maudsley NHS Foundation Trust and King's College London, London, UK.
¹⁰ Department of Clinical Neuroscience, Institute of Psychiatry, Psychology and Neuroscience, King's College London, London, UK; Department of Neurology, King's College Hospital NHS Foundation Trust, London, UK.
¹¹ Department of Biostatistics and Health Informatics, Institute of Psychiatry, Psychology and Neuroscience, King's College London, London, UK; Health Data Research UK London, University College London, London, UK; Institute of Health Informatics, University College London, London, UK; NIHR Biomedical Research Centre at South London and Maudsley NHS Foundation Trust and King's College London, London, UK. Electronic address: richard.j.dobson@kcl.ac.uk.

PMID: 34127232
DOI: 10.1016/j.artmed.2021.102083

Abstract

Electronic health records (EHR) contain large volumes of unstructured text, requiring the application of information extraction (IE) technologies to enable clinical analysis. We present the open source Medical Concept Annotation Toolkit (MedCAT) that provides: (a) a novel self-supervised machine learning algorithm for extracting concepts using any concept vocabulary including UMLS/SNOMED-CT; (b) a feature-rich annotation interface for customizing and training IE models; and (c) integrations to the broader CogStack ecosystem for vendor-agnostic health system deployment. We show improved performance in extracting UMLS concepts from open datasets (F1:0.448-0.738 vs 0.429-0.650). Further real-world validation demonstrates SNOMED-CT extraction at 3 large London hospitals with self-supervised training over ∼8.8B words from ∼17M clinical records and further fine-tuning with ∼6K clinician annotated examples. We show strong transferability (F1 > 0.94) between hospitals, datasets and concept types indicating cross-domain EHR-agnostic utility for accelerated clinical and research use cases.

Keywords: Clinical concept embeddings; Clinical natural language processing; Clinical ontology embeddings; Electronic health record information extraction.

Publication types

Research Support, Non-U.S. Gov't

MeSH terms

Electronic Health Records
Information Storage and Retrieval
Natural Language Processing*
Systematized Nomenclature of Medicine*
Unified Medical Language System

Abstract

Publication types

MeSH terms

Grants and funding