Audio-Visual Synchronization
9 papers with code • 0 benchmarks • 3 datasets
Benchmarks
These leaderboards are used to track progress in Audio-Visual Synchronization
Most implemented papers
Sparse in Space and Time: Audio-visual Synchronisation with Trainable Selectors
This contrasts with the case of synchronising videos of talking heads, where audio-visual correspondence is dense in both time and space.
Multimodal Transformer Distillation for Audio-Visual Synchronization
This paper proposed an MTDVocaLiST model, which is trained by our proposed multimodal Transformer distillation (MTD) loss.
Synchformer: Efficient Synchronization from Sparse Cues
Our objective is audio-visual synchronization with a focus on 'in-the-wild' videos, such as those on YouTube, where synchronization cues can be sparse.
Solos: A Dataset for Audio-Visual Music Analysis
In this paper, we present a new dataset of music performance videos which can be used for training machine learning methods for multiple tasks such as audio-visual blind source separation and localization, cross-modal correspondences, cross-modal generation and, in general, any audio-visual selfsupervised task.
Neural Pitch-Shifting and Time-Stretching with Controllable LPCNet
Modifying the pitch and timing of an audio signal are fundamental audio editing operations with applications in speech manipulation, audio-visual synchronization, and singing voice editing and synthesis.
VocaLiST: An Audio-Visual Synchronisation Model for Lips and Voices
Finally, we use the frozen visual features learned by our lip synchronisation model in the singing voice separation task to outperform a baseline audio-visual model which was trained end-to-end.
Target Active Speaker Detection with Audio-visual Cues
To benefit from both facial cue and reference speech, we propose the Target Speaker TalkNet (TS-TalkNet), which leverages a pre-enrolled speaker embedding to complement the audio-visual synchronization cue in detecting whether the target speaker is speaking.
PEAVS: Perceptual Evaluation of Audio-Visual Synchrony Grounded in Viewers' Opinion Scores
Recent advancements in audio-visual generative modeling have been propelled by progress in deep learning and the availability of data-rich benchmarks.
Explicit Correlation Learning for Generalizable Cross-Modal Deepfake Detection
With the rising prevalence of deepfakes, there is a growing interest in developing generalizable detection methods for various types of deepfakes.