Background: Polysomnography (PSG) is the gold standard for diagnosing sleep disorders. However, the subjectivity of manual scoring can lead to inter-scorer variability, undermining diagnostic accuracy and subsequent clinical decisions. This study aims to quantitatively assess scoring concordance among multiple scorers across various clinical subgroups to identify the factors that contribute to interpretive discrepancies. Methods: We conducted a retrospective analysis of overnight diagnostic PSG data from adult patients at a tertiary university hospital sleep center. Interrater reliability was evaluated by three independent expert scorers for 30 subjects selected through stratified random sampling. The polysomnographic data were independently and blindly scored according to the American Academy of Sleep Medicine criteria, focusing on sleep stages, arousals, respiratory events, and leg movements, all scored in 30 s epochs. Interrater agreement was measured using Fleiss’ κ, along with 95% confidence intervals, and included subgroup analyses by diagnostic category. Results: The analysis included a total of 28,291 epochs from 30 adults across normal, insomnia, obstructive sleep apnea (OSA) mild–severe, and periodic limb movement (PLM) disorder subgroups. The overall interrater agreement for sleep staging among the three scorers was nearly perfect (Fleiss’ κ = 0.932), with the highest concordance observed in stages W, N2, and R, and excellent agreement in stages N1 and N3. Respiratory events showed particularly high reliability, with near-perfect agreement for apnea (κ = 0.955) and substantial agreement for hypopnea, arousals, and PLMs. Pairwise analyses indicated the highest concordance between scorer 1 and scorer 3, while the agreement between scorer 1 and scorer 2 was lower, particularly for detecting arousals and limb movements. Subgroup analyses showed the highest and most stable agreement in moderate OSA, whereas severe OSA exhibited reduced reliability for sleep staging and arousal scoring, indicating increased scoring complexity with greater sleep fragmentation. Conclusions: Although expert PSG scoring demonstrates high overall reliability, significant variability persists in complex cases like severe OSA. These findings underscore the necessity for structured quality assurance and automated tools to improve diagnostic consistency in clinical practice.
Choi et al. (Tue,) studied this question.
Synapse has enriched 5 closely related papers on similar clinical questions. Consider them for comparative context: