In addition to detection, classification, and localization of sound events, the ability to isolate the sound of one or multiple sound sources in an acoustic scene is important for downstream applications such as assisted listening, and virtual/augmented reality. Because of the wide variety of possible sound sources we may want to isolate, the ability to flexibly specify the source(s) of interest (e.g., via natural language queries) is necessary for most practical applications. In this talk I will first discuss the advantages and disadvantages of natural language queries for audio source extraction. I will then discuss Task-aware Unified Source Separation (TUSS), a recent prompt-based separation model that can decompose a sound scene into a flexible number of output sources including multiple sources of the same type.
Wichern et al. (Wed,) studied this question.