Study objectives: To determine the reasons for inter-scorer variability in sleep staging of polysomnograms (PSGs).
Methods: Fifty-six PSGs were scored (5-stage sleep scoring) by 2 experienced technologists, (first manual, M1). Months later, the technologists edited their own scoring (second manual, M2) based upon feedback from the investigators that highlighted differences between their scoring. The PSGs were then scored with an automatic system (Auto) and the technologists edited them, epoch-by-epoch (Edited-Auto). This resulted in 6 different manual scores for each PSG. Epochs were classified as scorer errors (one M1 score differed from the other 5 scores), scorer bias (all 3 scores of each technologist were similar, but differed from the other technologist) and equivocal (sleep scoring was inconsistent within and between technologists).
Results: Percent agreement after M1 was 78.9% ± 9.0% and was unchanged after M2 (78.1% ± 9.7%) despite numerous edits (≈40/PSG) by the scorers. Agreement in Edited-Auto was higher (86.5% ± 6.4%, p < 1E-9). Scorer errors (< 2% of epochs) and scorer bias (3.5% ± 2.3% of epochs) together accounted for < 20% of M1 disagreements. A large number of epochs (92 ± 44/PSG) with scoring agreement in M1 were subsequently changed in M2 and/or Edited-Auto. Equivocal epochs, which showed scoring inconsistency, accounted for 28% ± 12% of all epochs, and up to 76% of all epochs in individual patients. Disagreements were largely between awake/NREM, N1/N2, and N2/N3 sleep.
Conclusion: Inter-scorer variability is largely due to epochs that are difficult to classify. Availability of digitally identified events (e.g., spindles) or calculated variables (e.g., depth of sleep, delta wave duration) during scoring may greatly reduce scoring variability.
Keywords: PSG; automated scoring; inter-observer variability; sleep stages.
© 2016 American Academy of Sleep Medicine.