Generating Room Impulse Responses (RIRs) using deep neural networks has been gaining growing interest due to its potential for realistic acoustic simulation. However, the lack of large-scale RIR datasets hinders the development of RIR generative models, often necessitating workarounds such as data augmentation. In this paper, we explore fine-tuning a text-to-audio generation model as a method to generate plausible RIRs conditioned on acoustic parameters described in natural language prompts. Our experiments tested two hypotheses: (1) audio generative priors from the pretrained model can be effectively leveraged in RIR generation, and (2) the multimodal text-audio latent space of the pretrained model can effectively represent acoustic parameters expressed in natural language for RIR generation. Experimental results demonstrate that the RIR-finetuned audio generation model can generate plausible RIRs conditioned on various natural language prompts describing acoustic parameters. This study is the first research to demonstrate that generative priors learned by text-to-audio generation models can be effectively leveraged for RIR synthesis, introducing a novel methodological approach to the field.
Kim et al. (Wed,) studied this question.