Video Recognition
145 papers with code • 0 benchmarks • 10 datasets
Video Recognition is a process of obtaining, processing, and analysing data that it receives from a visual source, specifically video.
Benchmarks
These leaderboards are used to track progress in Video Recognition
Libraries
Use these libraries to find Video Recognition models and implementationsDatasets
Latest papers
ZeroI2V: Zero-Cost Adaptation of Pre-trained Transformers from Image to Video
In this paper, our goal is to present a zero-cost adaptation paradigm (ZeroI2V) to transfer the image transformers to video recognition tasks (i. e., introduce zero extra cost to the adapted models during inference).
Disentangling Spatial and Temporal Learning for Efficient Image-to-Video Transfer Learning
When pre-training on the large-scale Kinetics-710, we achieve 89. 7% on Kinetics-400 with a frozen ViT-L model, which verifies the scalability of DiST.
Eventful Transformers: Leveraging Temporal Redundancy in Vision Transformers
In this work, we exploit temporal redundancy between subsequent inputs to reduce the cost of Transformers for video processing.
Learning from Semantic Alignment between Unpaired Multiviews for Egocentric Video Recognition
To facilitate the data efficiency of multiview learning, we further perform video-text alignment for first-person and third-person videos, to fully leverage the semantic knowledge to improve video representations.
Audio-Visual Class-Incremental Learning
We demonstrate that joint audio-visual modeling can improve class-incremental learning, but current methods fail to preserve semantic similarity between audio and visual features as incremental step grows.
Helping Hands: An Object-Aware Ego-Centric Video Recognition Model
We demonstrate the performance of the object-aware representations learnt by our model, by: (i) evaluating it for strong transfer, i. e. through zero-shot testing, on a number of downstream video-text retrieval and classification benchmarks; and (ii) by using the representations learned as input for long-term video understanding tasks (e. g. Episodic Memory in Ego4D).
Orthogonal Temporal Interpolation for Zero-Shot Video Recognition
We propose a model called OTI for ZSVR by employing orthogonal temporal interpolation and the matching loss based on VLMs.
Prune Spatio-temporal Tokens by Semantic-aware Temporal Accumulation
Based on the STA score, we are able to progressively prune the tokens without introducing any additional parameters or requiring further re-training.
What Can Simple Arithmetic Operations Do for Temporal Modeling?
We conduct comprehensive ablation studies on the instantiation of ATMs and demonstrate that this module provides powerful temporal modeling capability at a low computational cost.
Video-FocalNets: Spatio-Temporal Focal Modulation for Video Action Recognition
Video transformer designs are based on self-attention that can model global context at a high computational cost.