no code implementations • 22 Apr 2024 • Tengda Han, Max Bain, Arsha Nagrani, Gül Varol, Weidi Xie, Andrew Zisserman
Generating Audio Description (AD) for movies is a challenging task that requires fine-grained visual understanding and an awareness of the characters and their names.
no code implementations • 1 Apr 2024 • Joao F. Henriques, Dylan Campbell, Tengda Han
As the horses have long left the barn, our proposal may be seen as antiquated and irrelevant.
no code implementations • 21 Dec 2023 • Zeqian Li, Qirui Chen, Tengda Han, Ya zhang, Yanfeng Wang, Weidi Xie
In this paper, we consider the problem of temporally aligning the video and texts from instructional videos, specifically, given a long-term video, and associated text sentences, our goal is to determine their corresponding timestamps in the video.
no code implementations • 10 Oct 2023 • Tengda Han, Max Bain, Arsha Nagrani, Gül Varol, Weidi Xie, Andrew Zisserman
Audio Description (AD) is the task of generating descriptions of visual content, at suitable time intervals, for the benefit of visually impaired audiences.
1 code implementation • 17 Jul 2023 • Lukas Knobel, Tengda Han, Yuki M. Asano
While recent supervised methods for reference-based object counting continue to improve the performance on benchmark datasets, they have to rely on small datasets due to the cost associated with manually annotating dozens of objects in images.
1 code implementation • 2 Jun 2023 • Niki Amini-Naieni, Kiana Amini-Naieni, Tengda Han, Andrew Zisserman
Our objective is open-world object counting in images, where the target object class is specified by a text description.
Ranked #1 on Zero-Shot Counting on FSC147
1 code implementation • CVPR 2023 • Tengda Han, Max Bain, Arsha Nagrani, Gül Varol, Weidi Xie, Andrew Zisserman
The objective of this paper is an automatic Audio Description (AD) model that ingests movies and outputs AD in text form.
no code implementations • ICCV 2023 • Tengda Han, Max Bain, Arsha Nagrani, Gul Varol, Weidi Xie, Andrew Zisserman
Audio Description (AD) is the task of generating descriptions of visual content, at suitable time intervals, for the benefit of visually impaired audiences.
1 code implementation • 12 Oct 2022 • Jochem Loedeman, Maarten C. Stol, Tengda Han, Yuki M. Asano
With the introduction of the transformer architecture in computer vision, increasing model scale has been demonstrated as a clear path to achieving performance and robustness gains.
no code implementations • 10 Oct 2022 • Tengda Han, Weidi Xie, Andrew Zisserman
The objective of this paper is an efficient training method for video tasks.
5 code implementations • DeepMind 2022 • Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Mensch, Katie Millican, Malcolm Reynolds, Roman Ring, Eliza Rutherford, Serkan Cabi, Tengda Han, Zhitao Gong, Sina Samangooei, Marianne Monteiro, Jacob Menick, Sebastian Borgeaud, Andrew Brock, Aida Nematzadeh, Sahand Sharifzadeh, Mikolaj Binkowski, Ricardo Barreira, Oriol Vinyals, Andrew Zisserman, Karen Simonyan
Building models that can be rapidly adapted to novel tasks using only a handful of annotated examples is an open challenge for multimodal machine learning research.
Ranked #1 on Action Recognition on RareAct
1 code implementation • CVPR 2022 • Tengda Han, Weidi Xie, Andrew Zisserman
The objective of this paper is a temporal alignment network that ingests long term video sequences, and associated text sentences, in order to: (1) determine if a sentence is alignable with the video; and (2) if it is alignable, then determine its alignment.
1 code implementation • 8 Dec 2021 • Chen Ju, Tengda Han, Kunhao Zheng, Ya zhang, Weidi Xie
Image-based visual-language (I-VL) pre-training has shown great success for learning joint visual-textual representations from large-scale web data, revealing remarkable ability for zero-shot generalisation.
Ranked #5 on Zero-Shot Action Detection on ActivityNet-1.3
1 code implementation • NeurIPS 2020 • Tengda Han, Weidi Xie, Andrew Zisserman
The objective of this paper is visual-only self-supervised video representation learning.
Ranked #12 on Self-Supervised Action Recognition on HMDB51 (finetuned)
1 code implementation • ECCV 2020 • Tengda Han, Weidi Xie, Andrew Zisserman
The objective of this paper is self-supervised learning from video, in particular for representations for action recognition.
1 code implementation • 10 Sep 2019 • Tengda Han, Weidi Xie, Andrew Zisserman
The objective of this paper is self-supervised learning of spatio-temporal embeddings from video, suitable for human action recognition.
Ranked #33 on Self-Supervised Action Recognition on UCF101
Representation Learning Self-Supervised Action Recognition +2
no code implementations • 19 Sep 2017 • Tengda Han, Jue Wang, Anoop Cherian, Stephen Gould
For effective human-robot interaction, it is important that a robotic assistant can forecast the next action a human will consider in a given task.
no code implementations • 24 Jul 2017 • Sam Toyer, Anoop Cherian, Tengda Han, Stephen Gould
Human pose forecasting is an important problem in computer vision with applications to human-robot interaction, visual surveillance, and autonomous driving.