Cardiology record multi-label classification using latent Dirichlet allocation
- PMID: 30195419
- DOI: 10.1016/j.cmpb.2018.07.002
Cardiology record multi-label classification using latent Dirichlet allocation
Abstract
Background and objectives: Electronic health records (EHRs) convey vast and valuable knowledge about dynamically changing clinical practices. Indeed, clinical documentation entails the inspection of massive number of records across hospitals and hospital sections. The goal of this study is to provide an efficient framework that will help clinicians explore EHRs and attain alternative views related to both patient-segments and diseases, like clustering and statistical information about the development of heart diseases (replacement of pacemakers, valve implantation etc.) in co-occurrence with other diseases. The task is challenging, dealing with lengthy health records and a high number of classes in a multi-label setting.
Methods: LDA is a statistical procedure optimized to explain a document by multinomial distributions on their latent topics and the topics by distributions on related words. These distributions allow to represent collections of texts into a continuous space enabling distance-based associations between documents and also revealing the underlying topics. The topic models were assessed by means of four divergence metrics. In addition, we applied LDA to the task of multi-label document classification of EHRs according to the International Classification of Diseases 10th Clinical Modification (ICD-10). The set of EHRs had assigned 7 codes on average over 970 different codes corresponding to cardiology.
Results: First, the discriminative ability of topic models was assessed using dissimilarity metrics. Nevertheless, there was an open question regarding the interpretability of automatically discovered topics. To address this issue, we explored the connection between the latent topics and ICD-10. EHRs were represented by means of LDA and, next, supervised classifiers were inferred from those representations. Given the low-dimensional representation provided by LDA, the search was computationally efficient compared to symbolic approaches such as TF-IDF. The classifiers achieved an average AUC of 77.79. As a side contribution, with this work we released the software implemented in Python and R to both train and evaluate the models.
Conclusions: Topic modeling offers a means of representing EHRs in a small dimensional continuous space. This representation conveys relevant information as hidden topics in a comprehensive manner. Moreover, in practice, this compact representation allowed to extract the ICD-10 codes associated to EHRs.
Keywords: Electronic health records; ICD-10 classification; Latent Dirichlet Allocation.
Copyright © 2018 Elsevier B.V. All rights reserved.
Similar articles
-
Exploiting ICD Hierarchy for Classification of EHRs in Spanish Through Multi-Task Transformers.IEEE J Biomed Health Inform. 2022 Mar;26(3):1374-1383. doi: 10.1109/JBHI.2021.3112130. Epub 2022 Mar 7. IEEE J Biomed Health Inform. 2022. PMID: 34520380
-
Improving the utility of MeSH® terms using the TopicalMeSH representation.J Biomed Inform. 2016 Jun;61:77-86. doi: 10.1016/j.jbi.2016.03.013. Epub 2016 Mar 19. J Biomed Inform. 2016. PMID: 27001195 Free PMC article.
-
Boosting ICD multi-label classification of health records with contextual embeddings and label-granularity.Comput Methods Programs Biomed. 2020 May;188:105264. doi: 10.1016/j.cmpb.2019.105264. Epub 2019 Dec 10. Comput Methods Programs Biomed. 2020. PMID: 31851906
-
Examining Analytic Practices in Latent Dirichlet Allocation Within Psychological Science: Scoping Review.J Med Internet Res. 2022 Nov 8;24(11):e33166. doi: 10.2196/33166. J Med Internet Res. 2022. PMID: 36346659 Free PMC article. Review.
-
Definition, structure, content, use and impacts of electronic health records: a review of the research literature.Int J Med Inform. 2008 May;77(5):291-304. doi: 10.1016/j.ijmedinf.2007.09.001. Epub 2007 Oct 22. Int J Med Inform. 2008. PMID: 17951106 Review.
Cited by
-
An Improved Long Short-Term Memory Algorithm for Cardiovascular Disease Prediction.Diagnostics (Basel). 2024 Jan 23;14(3):239. doi: 10.3390/diagnostics14030239. Diagnostics (Basel). 2024. PMID: 38337755 Free PMC article.
-
Integrating unsupervised and supervised learning techniques to predict traumatic brain injury: A population-based study.Intell Based Med. 2023;8:100118. doi: 10.1016/j.ibmed.2023.100118. Epub 2023 Nov 8. Intell Based Med. 2023. PMID: 38222038 Free PMC article.
-
Finding Potential Adverse Events in the Unstructured Text of Electronic Health Care Records: Development of the Shakespeare Method.JMIRx Med. 2021 Aug 11;2(3):e27017. doi: 10.2196/27017. JMIRx Med. 2021. PMID: 37725533 Free PMC article.
-
Using topic modelling for unsupervised annotation of electronic health records to identify an outbreak of disease in UK dogs.PLoS One. 2021 Dec 9;16(12):e0260402. doi: 10.1371/journal.pone.0260402. eCollection 2021. PLoS One. 2021. PMID: 34882714 Free PMC article.
-
Automatic Prediction of Recurrence of Major Cardiovascular Events: A Text Mining Study Using Chest X-Ray Reports.J Healthc Eng. 2021 Jul 9;2021:6663884. doi: 10.1155/2021/6663884. eCollection 2021. J Healthc Eng. 2021. PMID: 34306597 Free PMC article.
MeSH terms
LinkOut - more resources
Full Text Sources
Other Literature Sources
Research Materials
Miscellaneous
