What question did this study set out to answer?

The aim is to improve sound event detection and separation using spatial semantic segmentation techniques.

May 14, 2026

Spatial semantic segmentation of sound scenes

Key Points

The aim is to improve sound event detection and separation using spatial semantic segmentation techniques.
Adoption of S5 task in DCASE2025 Task4
Utilization of the DCASE2025 Task4 Dataset
Decomposition of sound mixtures into individual sound objects with spatial metadata.
Successfully updates the position of sound objects in real time based on listener movement.
Enhances user experience in extended reality settings by accurately capturing acoustic scenes.
Provides support for assisted living through effective room sound monitoring.

Abstract

Spatial Semantic Segmentation of Sound Scenes (S5) aims to enhance technologies for sound event detection and separation from multi-channel input signals that mix multiple sound events with spatial information. One possible application of S5 is extended reality (XR) services that capture a user’s surrounding acoustic scene and transmit it to remote participants. By decomposing the mixture into individual sound objects paired with class labels and 6 degree of freedom (6DoF) metadata, the rendering engine can update direction and distance as the listener moves in real time. S5 can also support assisted living through room sound monitoring. S5 task was adopted as DCASE2025 Task4, and its setting within the Detection and Classification of Acoustic Scenes and Events Challenge and the newly recorded DCASE2025 Task4 Dataset are outlined. In this presentation we relate S5 to past DCASE tasks, describe the new dataset, and discuss current challenges and future directions for S5.

Mark Helpful

Bookmark

Relay