A clinical text classification paradigm using weak supervision and deep representation
- PMID: 30616584
- PMCID: PMC6322223
- DOI: 10.1186/s12911-018-0723-6
A clinical text classification paradigm using weak supervision and deep representation
Abstract
Background: Automatic clinical text classification is a natural language processing (NLP) technology that unlocks information embedded in clinical narratives. Machine learning approaches have been shown to be effective for clinical text classification tasks. However, a successful machine learning model usually requires extensive human efforts to create labeled training data and conduct feature engineering. In this study, we propose a clinical text classification paradigm using weak supervision and deep representation to reduce these human efforts.
Methods: We develop a rule-based NLP algorithm to automatically generate labels for the training data, and then use the pre-trained word embeddings as deep representation features for training machine learning models. Since machine learning is trained on labels generated by the automatic NLP algorithm, this training process is called weak supervision. We evaluat the paradigm effectiveness on two institutional case studies at Mayo Clinic: smoking status classification and proximal femur (hip) fracture classification, and one case study using a public dataset: the i2b2 2006 smoking status classification shared task. We test four widely used machine learning models, namely, Support Vector Machine (SVM), Random Forest (RF), Multilayer Perceptron Neural Networks (MLPNN), and Convolutional Neural Networks (CNN), using this paradigm. Precision, recall, and F1 score are used as metrics to evaluate performance.
Results: CNN achieves the best performance in both institutional tasks (F1 score: 0.92 for Mayo Clinic smoking status classification and 0.97 for fracture classification). We show that word embeddings significantly outperform tf-idf and topic modeling features in the paradigm, and that CNN captures additional patterns from the weak supervision compared to the rule-based NLP algorithms. We also observe two drawbacks of the proposed paradigm that CNN is more sensitive to the size of training data, and that the proposed paradigm might not be effective for complex multiclass classification tasks.
Conclusion: The proposed clinical text classification paradigm could reduce human efforts of labeled training data creation and feature engineering for applying machine learning to clinical text classification by leveraging weak supervision and deep representation. The experimental experiments have validated the effectiveness of paradigm by two institutional and one shared clinical text classification tasks.
Keywords: Clinical text classification; Electronic health records; Machine learning; Natural language processing; Weak supervision.
Conflict of interest statement
Ethics approval and consent to participate
This study was a retrospective study of existing records. The study and a waiver of informed consent were approved by Mayo Clinic Institutional Review Board in accordance with 45 CFR 46.116 (Approval #17–003030).
Consent for publication
Not applicable; the manuscript does not contain individual level of data.
Competing interests
The authors declare that they have no competing interests.
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Figures
Similar articles
-
Evaluating shallow and deep learning strategies for the 2018 n2c2 shared task on clinical text classification.J Am Med Inform Assoc. 2019 Nov 1;26(11):1247-1254. doi: 10.1093/jamia/ocz149. J Am Med Inform Assoc. 2019. PMID: 31512729 Free PMC article.
-
Classifying the lifestyle status for Alzheimer's disease from clinical notes using deep learning with weak supervision.BMC Med Inform Decis Mak. 2022 Jul 7;22(Suppl 1):88. doi: 10.1186/s12911-022-01819-4. BMC Med Inform Decis Mak. 2022. PMID: 35799294 Free PMC article.
-
Medical subdomain classification of clinical notes using a machine learning-based natural language processing approach.BMC Med Inform Decis Mak. 2017 Dec 1;17(1):155. doi: 10.1186/s12911-017-0556-8. BMC Med Inform Decis Mak. 2017. PMID: 29191207 Free PMC article.
-
Clinical Text Data in Machine Learning: Systematic Review.JMIR Med Inform. 2020 Mar 31;8(3):e17984. doi: 10.2196/17984. JMIR Med Inform. 2020. PMID: 32229465 Free PMC article. Review.
-
Machine Learning and Natural Language Processing in Mental Health: Systematic Review.J Med Internet Res. 2021 May 4;23(5):e15708. doi: 10.2196/15708. J Med Internet Res. 2021. PMID: 33944788 Free PMC article. Review.
Cited by
-
The validity of electronic health data for measuring smoking status: a systematic review and meta-analysis.BMC Med Inform Decis Mak. 2024 Feb 2;24(1):33. doi: 10.1186/s12911-024-02416-3. BMC Med Inform Decis Mak. 2024. PMID: 38308231 Free PMC article.
-
Interrelated feature selection from health surveys using domain knowledge graph.Health Inf Sci Syst. 2023 Nov 16;11(1):54. doi: 10.1007/s13755-023-00254-7. eCollection 2023 Dec. Health Inf Sci Syst. 2023. PMID: 37981989
-
Artificial intelligence for dementia prevention.Alzheimers Dement. 2023 Dec;19(12):5952-5969. doi: 10.1002/alz.13463. Epub 2023 Oct 14. Alzheimers Dement. 2023. PMID: 37837420 Review.
-
Performance of ChatGPT, human radiologists, and context-aware ChatGPT in identifying AO codes from radiology reports.Sci Rep. 2023 Aug 30;13(1):14215. doi: 10.1038/s41598-023-41512-8. Sci Rep. 2023. PMID: 37648742 Free PMC article.
-
Text Analysis of Radiology Reports with Signs of Intracranial Hemorrhage on Brain CT Scans Using the Decision Tree Algorithm.Sovrem Tekhnologii Med. 2022;14(6):34-40. doi: 10.17691/stm2022.14.6.04. Epub 2022 Nov 28. Sovrem Tekhnologii Med. 2022. PMID: 37181285 Free PMC article.
References
-
- Dean BB, Lam J, Natoli JL, Butler Q, Aguilar D, Nordyke RJ. Use of electronic medical records for health outcomes research: a literature review. Med Care Res Rev. 2009;66:611–638. - PubMed
-
- Jensen PB, Jensen LJ, Brunak S. Mining electronic health records: towards better research applications and clinical care. Nat Rev Genet. 2012;13:395–405. - PubMed
Publication types
MeSH terms
Grants and funding
LinkOut - more resources
Full Text Sources
Miscellaneous
