Development of an unsupervised machine learning algorithm for the prognostication of walking ability in spinal cord injury patients

Spine J. 2020 Feb;20(2):213-224. doi: 10.1016/j.spinee.2019.09.007. Epub 2019 Sep 13.


Background context: Traumatic spinal cord injury can have a dramatic effect on a patient's life. The degree of neurologic recovery greatly influences a patient's treatment and expected quality of life. This has resulted in the development of machine learning algorithms (MLA) that use acute demographic and neurologic information to prognosticate recovery. The van Middendorp et al. (2011) (vM) logistic regression (LR) model has been established as a reference model for the prediction of walking recovery following spinal cord injury as it has been validated within many different countries. However, an examination of the way in which these prediction models are evaluated is warranted. The area under the receiver operators curve (AUROC) has been consistently used when evaluating model performance, but it has been shown that AUROC overemphasizes the most common event resulting in an inaccurate assessment when the data are imbalanced. Furthermore, there is evidence that the use of more advanced MLA, such as an unsupervised k-means model, may show superior performance compared to LR as they can handle a larger number of features.

Purpose: The first objective of the study was to assess the performance of both an unsupervised MLA and LR model with complete admission neurologic information against the vM and Hicks models. Second, a comparison between the accuracy of the AUROC and the F1-score will be made to determine which method is superior for the assessment of diagnostic performance of prediction models on large-scale datasets.

Study design: Retrospective review of a prospective cohort study.

Patient sample: The Rick Hansen Spinal Cord Injury Registry (RHSCIR) was used in this study. All patients enrolled between 2004 and 2017 with complete neurologic examination and Functional Independence Measure outcome data at ≥1 year follow-up or who could walk at discharge were included. The prognostic variables included age (dichotomized at ≥65 years old); American Spinal Injury Association Impairment Scale (AIS) grade; and individual motor, light touch, and pinprick score from L2 to S1.

Outcome measures: The Functional Independence Measure locomotor score was used to assess independent walking ability at discharge or 1-year follow-up.

Methods: An unsupervised MLA with k=2 was chosen in order to identify a "walk" cluster and a "not walk" cluster. Model performance was assessed through the development of a receiver operating characteristic curve with associated AUROC and a precision-recall curve with associated F1-score. The study and the RHSCIR are supported by funding from Health Canada, Western Economic Diversification Canada, and the Governments of Alberta, British Columbia, Manitoba, and Ontario. These funders had no role in the study or study reporting and the authors have no conflicts of interest to report.

Results: No clinically relevant differences were found between with the use of an unsupervised MLA with a greater amount of initial neurologic information compared to the established standards for any AIS classification. Although demonstrated for all separate AIS classifications, most notably, the AUROC for the vM (0.78) and Hicks models (0.76) were found to be superior to that of the new LR model (0.72); however, the vM and Hicks models had more than double the amount of false negative classifications compared to the LR. The F1-scores between these three models were also found to be different but with the vM and Hicks models being lower than the LR (0.85, 0.81, and 0.89, respectively).

Conclusions: No clinically relevant differences were found between the use of an unsupervised MLA with complete admission neurologic information compared to the previously validated standards; however, when comparing the performance of the AUROC and F1-score, the AUROC showed inaccurate prognostic performance when there was an imbalance toward a greater amount of false negatives. Importantly, the F1-score did not succumb to this imbalance. As AUROC has been used as the standard when evaluating performance of prediction models, consideration as to whether this is the most appropriate method is warranted. Future work should focus on comparing AUROC and F1-scores with other previously validated models.

Keywords: Area under the curve; F1-score; Logistic regression; Machine learning; Predictive accuracy; Prognosis; Traumatic spinal cord injury.

Publication types

  • Research Support, Non-U.S. Gov't

MeSH terms

  • Adult
  • Aged
  • Female
  • Humans
  • Male
  • Middle Aged
  • Neurologic Examination / methods
  • Prognosis
  • Recovery of Function
  • Spinal Cord Injuries / diagnosis*
  • Spinal Cord Injuries / rehabilitation
  • Unsupervised Machine Learning*
  • Walking*