What does this research mean for the field?

The ViSAGe framework generates high-quality first-order ambisonics from silent videos, outperforming existing video-to-audio models in spatial metrics. Novelty: ClaimNovelty.NOVEL_FINDING. Consensus alignment: ConsensusAlignment.NEUTRAL.

What question did this study set out to answer?

The aim is to generate first-order ambisonics from silent videos, improving audio-visual experiences.

March 12, 2026Open Access

Towards Scene-Aware Video-to-Spatial Audio Generation

Key Points

The aim is to generate first-order ambisonics from silent videos, improving audio-visual experiences.
Developed evaluation metrics for video-to-audio generation and spatial coherence.
Created YT-Ambigen dataset with 102K video clips and first-order ambisonics for training.
Introduced ViSAGe framework using CLIP features and energy maps for audio generation.
Proposed ViSAGe-SC for improved efficiency with single codebook and codec chaining.
ViSAGe-SC achieved 4x faster training and 5x faster inference compared to previous methods.
Experimental results showed superior performance in spatial metrics over several existing V2A models.
Generated high-quality spatial audio from silent videos while maintaining competitive semantic quality.

Abstract

Abstract Spatial audio is essential for enhancing the immersiveness of audio-visual experiences, yet its production typically demands complex recording systems and specialized expertise. In this work, we address a novel problem of generating first-order ambisonics, a widely used spatial audio format, directly from silent videos. To support this task, we develop comprehensive evaluation metrics that capture both standard video-to-audio generation quality and spatial coherence among multiple channels. We introduce YT-Ambigen, a dataset comprising 102K YouTube video clips paired with first-order ambisonics tailored for audio generation, and its expanded version YT-Ambigen+ containing 3x more clips with a rigorously validated high-quality test subset of 19.3K clips. Furthermore, we present Video-to-Spatial Audio Generation (ViSAGe), an end-to-end framework that generates first-order ambisonics from silent videos by leveraging CLIP features, patchwise energy maps, and neural audio codecs with rotation augmentation. To address efficiency challenges, we propose a variant coined ViSAGe-SC (Single Codebook), which replaces complex residual codebooks with an optimized single codebook approach, achieving 4x faster training and 5x faster inference while maintaining superior performance. ViSAGe-SC incorporates heterogeneous codec chaining for postprocessing and candidate reranking for inference-time refinement. Experimental results demonstrate that our approach outperforms several V2A models across spatial metrics and displays competitive performance in semantic quality, generating high-quality spatial audio from video input.

Bookmark

View Full Paper

Cite This Study

Kim et al. (Mon,) studied this question.

synapsesocial.com/papers/69b257a296eeacc4fcec65e8 https://doi.org/https://doi.org/10.1007/s11263-025-02610-4

Bookmark

View Full Paper