Study objectives: Automated sleep staging has been previously limited by a combination of clinical and physiological heterogeneity. Both factors are in principle addressable with large data sets that enable robust calibration. However, the impact of sample size remains uncertain. The objectives are to investigate the extent to which machine learning methods can approximate the performance of human scorers when supplied with sufficient training cases and to investigate how staging performance depends on the number of training patients, contextual information, model complexity, and imbalance between sleep stage proportions.
Methods: A total of 102 features were extracted from six electroencephalography (EEG) channels in routine polysomnography. Two thousand nights were partitioned into equal (n = 1000) training and testing sets for validation. We used epoch-by-epoch Cohen's kappa statistics to measure the agreement between classifier output and human scorer according to American Academy of Sleep Medicine scoring criteria.
Results: Epoch-by-epoch Cohen's kappa improved with increasing training EEG recordings until saturation occurred (n = ~300). The kappa value was further improved by accounting for contextual (temporal) information, increasing model complexity, and adjusting the model training procedure to account for the imbalance of stage proportions. The final kappa on the testing set was 0.68. Testing on more EEG recordings leads to kappa estimates with lower variance.
Conclusion: Training with a large data set enables automated sleep staging that compares favorably with human scorers. Because testing was performed on a large and heterogeneous data set, the performance estimate has low variance and is likely to generalize broadly.
Keywords: EEG; big data; machine learning; sleep stages.
© Sleep Research Society 2017. Published by Oxford University Press on behalf of the Sleep Research Society. All rights reserved. For permissions, please e-mail email@example.com.