Abstract Introduction Recent sound-based AI models have shown strong performance in predicting sleep staging, obstructive sleep apnea (OSA) events, and snore detection from nocturnal breathing sounds. Prior studies have also demonstrated that breathing sounds contain physiological signatures related to oxygen desaturation, arousals, and sleep position. However, current models typically use a shared feature extractor followed by separate task-specific classifiers, leading to rapid growth in parameters and training time as the number of tasks increases. To address this, we converted one of the existing classifiers into a multi-head classifier that outputs multiple tasks via lightweight multilayer perceptron (MLP) heads, incurring minimal additional computational cost. This study evaluates the unified model across six sleep-related tasks. Methods The model integrates four tasks - OSA event, desaturation, arousal, and sleep position - into one shared classifier with lightweight MLP heads (sleep staging and snore detection retain their existing dedicated classifiers). The architecture processes 80 consecutive Mel-spectrogram frames representing each 30-second epoch. Training was performed using 2,973 nights of PSG data with synchronized audio, and evaluation was conducted on an independent dataset of 802 nights (age 50.5 ± 15.4; BMI 25.5 ± 3.6; AHI 21.2 ± 20.6; male:female = 530:272). All tasks, except sleep staging, were trained at the sub-epoch level but evaluated at the epoch level for comparison with previous single-task models. Results The unified model achieved performance comparable to single-task baselines. The three tasks maintained robust performance (sleep staging: macro F1 0.77, accuracy 80.6%; OSA: macro F1 0.78, accuracy 89.5%; snore detection: macro F1 0.89, accuracy 90.4%). The newly integrated tasks showed slight improvements relative to previously reported models: desaturation (F1 0.80, sensitivity 0.88, specificity 0.92 for desaturation-containing epochs), arousal (F1 0.67, sensitivity 0.74, specificity 0.91 for arousal-containing epochs), sleep position (F1 0.88, sensitivity 0.89, specificity 0.74 for supine epochs; overall accuracy 84.0%, macro F1 0.82). Conclusion The ability to jointly infer diverse sleep events from a single multi-head classifier suggests that these tasks rely on overlapping physiological representations. This scalable architecture enables efficient multitask learning without loss of accuracy and provides a strong basis for developing a sound-based foundation model capable of comprehensive sleep analysis. Support (if any)
Kim et al. (Fri,) studied this question.