Zero-Shot Action Recognition
34 papers with code • 7 benchmarks • 6 datasets
Benchmarks
These leaderboards are used to track progress in Zero-Shot Action Recognition
Libraries
Use these libraries to find Zero-Shot Action Recognition models and implementationsMost implemented papers
Tell me what you see: A zero-shot action recognition method based on natural language descriptions
To the best of our knowledge, this is the first work to represent both videos and labels with descriptive sentences.
End-to-End Semantic Video Transformer for Zero-Shot Action Recognition
While video action recognition has been an active area of research for several years, zero-shot action recognition has only recently started gaining traction.
Rethinking Zero-shot Action Recognition: Learning from Latent Atomic Actions
However, due to the complexity of actions, it remains challenging to transfer knowledge learned from source to target action domains.
Alignment-Uniformity aware Representation Learning for Zero-shot Video Classification
Further, we synthesize features of unseen classes by proposing a class generator that interpolates and extrapolates the features of seen classes.
MILES: Visual BERT Pre-training with Injected Language Semantics for Video-text Retrieval
Dominant pre-training work for video-text retrieval mainly adopt the "dual-encoder" architectures to enable efficient retrieval, where two separate encoders are used to contrast global video and text representations, but ignore detailed local semantics.
A CLIP-Hitchhiker's Guide to Long Video Retrieval
Our goal in this paper is the adaptation of image-text models for long video retrieval.
Global Semantic Descriptors for Zero-Shot Action Recognition
This work introduces a new ZSAR method based on the relationships of actions-objects and actions-descriptive sentences.
MAtch, eXpand and Improve: Unsupervised Finetuning for Zero-Shot Action Recognition with Language Knowledge
We adapt a VL model for zero-shot and few-shot action recognition using a collection of unlabeled videos and an unpaired action dictionary.
Vita-CLIP: Video and text adaptive CLIP via Multimodal Prompting
Through this prompting scheme, we can achieve state-of-the-art zero-shot performance on Kinetics-600, HMDB51 and UCF101 while remaining competitive in the supervised setting.
InternVid: A Large-scale Video-Text Dataset for Multimodal Understanding and Generation
Specifically, we utilize a multi-scale approach to generate video-related descriptions.