Key points are not available for this paper at this time.
Text-based audio generation models have limitations as they cannot encompass all the information in audio, leading to restricted controllability when relying solely on text. To address this issue, we propose a novel model that enhances the controllability of existing pre-trained text-to-audio models by incorporating additional conditions including content (timestamp) and style (pitch contour and energy contour) as supplements to the text. This approach achieves fine-grained control over the temporal order, pitch, and energy of generated audio. To preserve the diversity of generation, we employ a trainable control condition encoder that is enhanced by a large language model and a trainable Fusion-Net to encode and fuse the additional conditions while keeping the weights of the pre-trained text-to-audio model frozen. Due to the lack of suitable datasets and evaluation metrics, we consolidate existing datasets into a new dataset comprising the audio and corresponding conditions and use a series of evaluation metrics to evaluate the controllability performance. Experimental results demonstrate that our model successfully achieves fine-grained control to accomplish controllable audio generation.
Building similarity graph...
Analyzing shared references across papers
Loading...
Zhifang Guo
Jianguo Mao
Rui Tao
Chinese Academy of Sciences
University of Chinese Academy of Sciences
Institute of Computing Technology
Building similarity graph...
Analyzing shared references across papers
Loading...
Guo et al. (Sun,) studied this question.
www.synapsesocial.com/papers/68e72962b6db6435876a3313 — DOI: https://doi.org/10.1609/aaai.v38i16.29773