Abstract Spatial audio is essential for enhancing the immersiveness of audio-visual experiences, yet its production typically demands complex recording systems and specialized expertise. In this work, we address a novel problem of generating first-order ambisonics, a widely used spatial audio format, directly from silent videos. To support this task, we develop comprehensive evaluation metrics that capture both standard video-to-audio generation quality and spatial coherence among multiple channels. We introduce YT-Ambigen, a dataset comprising 102K YouTube video clips paired with first-order ambisonics tailored for audio generation, and its expanded version YT-Ambigen+ containing 3x more clips with a rigorously validated high-quality test subset of 19.3K clips. Furthermore, we present Video-to-Spatial Audio Generation (ViSAGe), an end-to-end framework that generates first-order ambisonics from silent videos by leveraging CLIP features, patchwise energy maps, and neural audio codecs with rotation augmentation. To address efficiency challenges, we propose a variant coined ViSAGe-SC (Single Codebook), which replaces complex residual codebooks with an optimized single codebook approach, achieving 4x faster training and 5x faster inference while maintaining superior performance. ViSAGe-SC incorporates heterogeneous codec chaining for postprocessing and candidate reranking for inference-time refinement. Experimental results demonstrate that our approach outperforms several V2A models across spatial metrics and displays competitive performance in semantic quality, generating high-quality spatial audio from video input.
Kim et al. (Mon,) studied this question.