2 code implementations • 29 Jan 2024 • Vladimir Iashin, Weidi Xie, Esa Rahtu, Andrew Zisserman
Our objective is audio-visual synchronization with a focus on 'in-the-wild' videos, such as those on YouTube, where synchronization cues can be sparse.
2 code implementations • 13 Oct 2022 • Vladimir Iashin, Weidi Xie, Esa Rahtu, Andrew Zisserman
This contrasts with the case of synchronising videos of talking heads, where audio-visual correspondence is dense in both time and space.
3 code implementations • 17 Oct 2021 • Vladimir Iashin, Esa Rahtu
In this work, we propose a single model capable of generating visually relevant, high-fidelity sounds prompted with a set of frames from open-domain videos in less time than it takes to play it on a single GPU.
no code implementations • 27 Jul 2021 • Alessio Xompero, Santiago Donaher, Vladimir Iashin, Francesca Palermo, Gökhan Solak, Claudio Coppola, Reina Ishikawa, Yuichi Nagao, Ryo Hachiuma, Qi Liu, Fan Feng, Chuanlin Lan, Rosa H. M. Chan, Guilherme Christmann, Jyun-Ting Song, Gonuguntla Neeharika, Chinnakotla Krishna Teja Reddy, Dinesh Jain, Bakhtawar Ur Rehman, Andrea Cavallaro
In this paper, we present a range of methods and an open framework to benchmark acoustic and visual perception for the estimation of the capacity of a container, and the type, mass, and amount of its content.
1 code implementation • 2 Dec 2020 • Vladimir Iashin, Francesca Palermo, Gökhan Solak, Claudio Coppola
CORSMAL 2020 Challenge focuses on the perception part of this problem: the robot needs to estimate the filling mass of a container held by a human.
2 code implementations • 17 May 2020 • Vladimir Iashin, Esa Rahtu
We show the effectiveness of the proposed model with audio and visual modalities on the dense video captioning task, yet the module is capable of digesting any two modalities in a sequence-to-sequence task.
4 code implementations • 17 Mar 2020 • Vladimir Iashin, Esa Rahtu
We apply automatic speech recognition (ASR) system to obtain a temporally aligned textual description of the speech (similar to subtitles) and treat it as a separate input alongside video frames and the corresponding audio track.
Ranked #11 on Dense Video Captioning on ActivityNet Captions
Automatic Speech Recognition Automatic Speech Recognition (ASR) +2