What question did this study set out to answer?

The study aims to assess the level of agreement among multiple scorers of polysomnography data across different clinical conditions.

April 16, 2026Open Access

Analysis of Interrater Reliability and Interpretive Discrepancies in Polysomnography Scoring Across Clinical Subgroups

Key Points

The study aims to assess the level of agreement among multiple scorers of polysomnography data across different clinical conditions.
Conducted a retrospective analysis of PSG data from adult patients at a university hospital sleep center
Evaluated interrater reliability using three expert scorers for 30 selected subjects
Scored polysomnographic data according to American Academy of Sleep Medicine criteria focusing on specific sleep parameters
Measured interrater agreement using Fleiss’ κ and performed subgroup analyses by diagnosis.
Overall interrater agreement for sleep staging was nearly perfect (Fleiss’ κ = 0.932)
Highest concordance was observed in stages W, N2, and R, and excellent in stages N1 and N3
Near-perfect agreement for apnea (κ = 0.955) and substantial for hypopnea, arousals, and PLMs
Variability was noted, especially for arousal detection between different scorers, particularly in severe OSA cases.

Abstract

Background: Polysomnography (PSG) is the gold standard for diagnosing sleep disorders. However, the subjectivity of manual scoring can lead to inter-scorer variability, undermining diagnostic accuracy and subsequent clinical decisions. This study aims to quantitatively assess scoring concordance among multiple scorers across various clinical subgroups to identify the factors that contribute to interpretive discrepancies. Methods: We conducted a retrospective analysis of overnight diagnostic PSG data from adult patients at a tertiary university hospital sleep center. Interrater reliability was evaluated by three independent expert scorers for 30 subjects selected through stratified random sampling. The polysomnographic data were independently and blindly scored according to the American Academy of Sleep Medicine criteria, focusing on sleep stages, arousals, respiratory events, and leg movements, all scored in 30 s epochs. Interrater agreement was measured using Fleiss’ κ, along with 95% confidence intervals, and included subgroup analyses by diagnostic category. Results: The analysis included a total of 28,291 epochs from 30 adults across normal, insomnia, obstructive sleep apnea (OSA) mild–severe, and periodic limb movement (PLM) disorder subgroups. The overall interrater agreement for sleep staging among the three scorers was nearly perfect (Fleiss’ κ = 0.932), with the highest concordance observed in stages W, N2, and R, and excellent agreement in stages N1 and N3. Respiratory events showed particularly high reliability, with near-perfect agreement for apnea (κ = 0.955) and substantial agreement for hypopnea, arousals, and PLMs. Pairwise analyses indicated the highest concordance between scorer 1 and scorer 3, while the agreement between scorer 1 and scorer 2 was lower, particularly for detecting arousals and limb movements. Subgroup analyses showed the highest and most stable agreement in moderate OSA, whereas severe OSA exhibited reduced reliability for sleep staging and arousal scoring, indicating increased scoring complexity with greater sleep fragmentation. Conclusions: Although expert PSG scoring demonstrates high overall reliability, significant variability persists in complex cases like severe OSA. These findings underscore the necessity for structured quality assurance and automated tools to improve diagnostic consistency in clinical practice.

Analysis of Interrater Reliability and Interpretive Discrepancies in Polysomnography Scoring Across Clinical Subgroups

Key Points

Abstract

Cite This Study

Also Consider

Also Consider