Background: The accurate classification of operative notes is essential for surgical outcomes research; however, CPT code classification is notoriously nonspecific for many procedures. In such situations, the operative note (or "dictation") must be reviewed manually, a process that is labor-intensive and unsustainable. Natural language processing demonstrates tremendous potential for improving the efficiency and accuracy of procedure classification from unstructured operative notes. To date, it remains unexplored whether natural language processing can reliably differentiate between complex, multicomponent procedures, such as those involved in the care of cleft lip or palate and craniofacial anomalies.
Objective: This study aims to develop and evaluate a machine learning framework for the automated classification of operative notes for cleft and craniofacial procedures.
Methods: This single-institution, retrospective observational study used operative notes from patients undergoing cleft and craniofacial procedures at a single academic medical center from 2016 to 2024. Each note in the database had been manually classified previously. Notes were preprocessed and vectorized using term frequency-inverse document frequency. A One-vs-Rest classification framework with random forest as the base classifier was developed to categorize procedures at 3 levels: primary procedure type (cleft lip repair, alveolar bone grafting, cleft palate repair, velopharyngeal insufficiency correction, rhinoplasty, and other), procedural subtype (primary vs revision), and specific surgical technique used (eg, Fisher, Mulliken, or rotation-advancement technique for cleft lip repair). Each hierarchical level was developed and evaluated using cross-validation. To improve procedural subtype classification for classes with few samples, synthetic notes were added to the dataset. Area under the receiver operating characteristic curve (AUC), an area under the precision-recall curve, micro- and macro-averaged F1-scores, and Hamming loss were used to assess model performance.
Results: The dataset comprised 630 operative notes from 311 pediatric patients undergoing cleft and craniofacial procedures between 2016 and 2024, with a mean age of 3.75 (range 0-19) years. The primary classification model achieved strong performance in distinguishing procedure types with an AUC of 0.93 (SD 0.04), area under the precision-recall curve of 0.84 (SD 0.05), micro-averaged F1-score of 0.88 (SD 0.02), a macro-averaged F1-score of 0.84 (SD 0.03), and a Hamming loss of 0.04 (SD 0.01). Secondary classifiers achieved AUC scores of 1.0 (SD 0.00) for cleft lip revision classification but failed to discriminate between alveolar bone grafting primary and revision procedures (AUC 0.49, SD 0.02). Tertiary classifiers for surgical technique identification showed AUC scores of 0.88 (SD 0.03), 0.89 (SD 0.03), and 0.89 (SD 0.09) for cleft lip, cleft palate, and velopharyngeal insufficiency repair techniques, respectively.
Conclusions: This pilot study demonstrates that machine learning approaches can automate the classification of pediatric craniofacial operative notes across multiple levels of procedural detail. The implementation of such systems could significantly reduce the administrative burden related to surgical research, operations, and quality improvement.
Keywords: cleft lip; cleft palate; craniofacial abnormalities; machine learning; natural language processing.
© Meredith Cox, Elaine Lin, Nicholas Oleck, Carlee Jones, Neill Y Li, Suhail K Mithani, Alexander C Allori. Originally published in JMIR Medical Informatics (https://medinform.jmir.org).