Tampere University
Tampere University
University of Oxford
Tampere University
https://arxiv.org/abs/2409.13689
https://github.com/ilpoviertola/V-AURA
We introduce V-AURA, the first autoregressive model to achieve high temporal alignment and relevance in video-to-audio generation. V-AURA uses a high-framerate visual feature extractor and a cross-modal audio-visual feature fusion strategy to capture fine-grained visual motion events and ensure precise temporal alignment. Additionally, we propose VisualSound, a benchmark dataset with high audio-visual relevance. VisualSound is based on VGGSound, a video dataset consisting of in-the-wild samples extracted from YouTube. During the curation, we remove samples where auditory events are not aligned with the visual ones. V-AURA outperforms current state-of-the-art models in temporal alignment and semantic relevance while maintaining comparable audio quality.
Figure 1: Overview of V-AURA.
Given a visual stream our autoregressive model generates semantically and temporally matching audio via predicting audio tokens encoded with a high fidelity neural audio codec. Our model utilizes a cross-modal feature fusion, emphasising the natural co-occurrence of audio and visual events better than conventional conditioning methods. We use a 6 times higher video framerate than the state-of-the-art [4, 5], capturing even the most immediate events in the video stream.
Evaluation. We use Frechet Audio Distance (FAD) to evaluate the audio quality, Kulback-Leibler Divergence (KLD) to evaluate the relevance of generated audio to the ground truth audio, ImageBind [2] score (IB) to evaluate the relevance of generated audio in respect of the conditional video stream, and synchronisation score (Sync) to evaluate the temporal alignment. Synchronisation score is the mean absolute offset among the generated samples and corresponding video stream in milliseconds judged by Synchformer [3].
We propose VisualSound, a subset of the VGGSound [1] dataset, curated for video samples with high audio-visual correspondence. The curation is based on the ImageBind model [2]. ImageBind is a state-of-the-art joint embedding model that can embed audio and video into the same feature space which enables us to calculate the cosine similarity between the modalities. In the table below we report the evaluation scores on VGGSound-Sparse [6] for the models trained with different VisualSound variants. We emphasize the temporal generation quality and thus select the one colored with green.
Thresh. | # Samples | Train ⏱️ ⬇️ | KLD ⬇️ | FAD ⬇️ | IB ⬆️ | Sync ⬇️ |
---|---|---|---|---|---|---|
0.0 | 155 591 | 708 | 1.94 | 3.16 | 28.51 | 60 |
0.2 | 119 469 | 662 | 1.94 | 3.49 | 28.66 | 59 |
0.3 | 77 265 | 278 | 1.93 | 3.55 | 28.92 | 49 |
0.4 | 33 225 | 168 | 1.97 | 4.11 | 27.31 | 71 |
Table 1: Different VisualSound variants. All samples with smaller cosine similarity than the reported threshold between the modality-specific embeddings are filtered out of the training set.
Figure 2: V-AURA generates temporally matching audio.
Datasets. We train our model on the VisualSound dataset. For evaluation, we use VGGSound-Sparse [6], and the test sets of VGGSound, VisualSound, and Visually Aligned Sounds (VAS) [7]. In particular, the VGGSound test set enables us to evaluate the overall generation quality over a wide range of audio classes. VGGSound-Sparse and VisualSound give a better view of how well the model aligns sounds with actions over time, as these datasets are curated for strong audio-visual relevance. To further evaluate the capabilities of V-AURA, we run experiments on VAS.
Visually guided audio generation. Our approach achieves state-of-the-art temporal generation quality (Sync) and relevance (KLD, IB), while maintining comparable audio quality (FAD). Figure above displays the temporal generation capabilities of V-AURA. We also show that the proposed VisualSound dataset is a valuable resource for training and evaluating audio-visual models.