Automated analysis of the American Academy of Sleep Medicine Inter-Scorer Reliability gold-standard polysomnogram dataset

J Clin Sleep Med. 2025 Nov 1;21(11):1821-1829. doi: 10.5664/jcsm.11848.

Abstract

Study objectives: We compared the performance of a comprehensive automated polysomnogram analysis algorithm-CAISR (Complete Artificial Intelligence Sleep Report)-to a multiexpert gold-standard panel, crowdsourced scorers, and experienced technicians for sleep staging and detecting arousals, respiratory events, and limb movements.

Methods: A benchmark dataset of 57 polysomnogram records (Inter-Scorer Reliability dataset) with 200 30-second epochs scored per American Academy of Sleep Medicine guidelines was used. Annotations were obtained from (1) the American Academy of Sleep Medicine multiexpert gold-standard panel, (2) American Academy of Sleep Medicine Inter-Scorer Reliability (ISR) platform users ("crowd," averaging 6,818 raters per epoch), (3) 3 experienced technicians, and (4) CAISR. Agreement was assessed via Cohen's kappa (κ) and percent agreement.

Results: Across tasks, CAISR achieved performance comparable to that of experienced technicians but did not match consensus-level agreement between the multiexpert gold standard and the crowd. For sleep staging, CAISR's agreement with multiexpert gold standard was 82.1% (κ = 0.70), comparable to experienced technicians but below the crowd (κ = 0.88). Arousal detection showed 87.81% agreement (κ = 0.45), respiratory event detection 83.18% agreement (κ = 0.34), and limb movement detection 94.89% agreement (κ = 0.11), each aligning with performance equivalent to experienced technicians but trailing crowd agreement (κ = 0.83, 0.78, and 0.86 for detection of arousal, respiratory events, and limb movements, respectively).

Conclusions: CAISR achieves experienced technician-level accuracy for polysomnogram scoring tasks but does not surpass the consensus-level agreement of a multiexpert gold standard or the crowd. These findings highlight the potential of automated scoring to match experienced technician-level performance while emphasizing the value of multirater consensus.

Citation: Tripathi A, Nasiri S, Ganglberger W, et al. Automated analysis of the American Academy of Sleep Medicine Inter-Scorer Reliability gold-standard polysomnogram dataset. J Clin Sleep Med. 2025;21(11):1821-1829.

Keywords: arousal detection; artificial intelligence; inter-rater reliability; limb movement; polysomnography; respiratory events; sleep staging.

MeSH terms

  • Algorithms
  • Artificial Intelligence
  • Humans
  • Polysomnography* / methods
  • Polysomnography* / standards
  • Reproducibility of Results
  • Sleep Medicine Specialty*
  • Sleep Stages / physiology
  • Societies, Medical
  • United States