Zero-Shot Action Recognition

34 papers with code • 7 benchmarks • 6 datasets

This task has no description! Would you like to contribute one?

Libraries

Use these libraries to find Zero-Shot Action Recognition models and implementations
2 papers
3,033
2 papers
212

Most implemented papers

Tell me what you see: A zero-shot action recognition method based on natural language descriptions

valterlej/zsarcap 18 Dec 2021

To the best of our knowledge, this is the first work to represent both videos and labels with descriptive sentences.

End-to-End Semantic Video Transformer for Zero-Shot Action Recognition

secure-and-intelligent-systems-lab/semanticvideotransformer 10 Mar 2022

While video action recognition has been an active area of research for several years, zero-shot action recognition has only recently started gaining traction.

Rethinking Zero-shot Action Recognition: Learning from Latent Atomic Actions

KevinQian97/JigsawNet ECCV 2022

However, due to the complexity of actions, it remains challenging to transfer knowledge learned from source to target action domains.

Alignment-Uniformity aware Representation Learning for Zero-shot Video Classification

ShipuLoveMili/CVPR2022-AURL CVPR 2022

Further, we synthesize features of unseen classes by proposing a class generator that interpolates and extrapolates the features of seen classes.

MILES: Visual BERT Pre-training with Injected Language Semantics for Video-text Retrieval

tencentarc/mcq 26 Apr 2022

Dominant pre-training work for video-text retrieval mainly adopt the "dual-encoder" architectures to enable efficient retrieval, where two separate encoders are used to contrast global video and text representations, but ignore detailed local semantics.

A CLIP-Hitchhiker's Guide to Long Video Retrieval

m-bain/clip-hitchhiker 17 May 2022

Our goal in this paper is the adaptation of image-text models for long video retrieval.

Global Semantic Descriptors for Zero-Shot Action Recognition

valterlej/objsentzsar 24 Sep 2022

This work introduces a new ZSAR method based on the relationships of actions-objects and actions-descriptive sentences.

MAtch, eXpand and Improve: Unsupervised Finetuning for Zero-Shot Action Recognition with Language Knowledge

wlin-at/maxi ICCV 2023

We adapt a VL model for zero-shot and few-shot action recognition using a collection of unlabeled videos and an unpaired action dictionary.

Vita-CLIP: Video and text adaptive CLIP via Multimodal Prompting

talalwasim/vita-clip CVPR 2023

Through this prompting scheme, we can achieve state-of-the-art zero-shot performance on Kinetics-600, HMDB51 and UCF101 while remaining competitive in the supervised setting.

InternVid: A Large-scale Video-Text Dataset for Multimodal Understanding and Generation

opengvlab/internvideo 13 Jul 2023

Specifically, we utilize a multi-scale approach to generate video-related descriptions.