Video Recognition
147 papers with code • 0 benchmarks • 10 datasets
Video Recognition is a process of obtaining, processing, and analysing data that it receives from a visual source, specifically video.
Benchmarks
These leaderboards are used to track progress in Video Recognition
Libraries
Use these libraries to find Video Recognition models and implementationsDatasets
Latest papers
What Can Simple Arithmetic Operations Do for Temporal Modeling?
We conduct comprehensive ablation studies on the instantiation of ATMs and demonstrate that this module provides powerful temporal modeling capability at a low computational cost.
Video-FocalNets: Spatio-Temporal Focal Modulation for Video Action Recognition
Video transformer designs are based on self-attention that can model global context at a high computational cost.
Hiera: A Hierarchical Vision Transformer without the Bells-and-Whistles
Modern hierarchical vision transformers have added several vision-specific components in the pursuit of supervised classification performance.
Implicit Temporal Modeling with Learnable Alignment for Video Recognition
While modeling temporal information within straight through tube is widely adopted in literature, we find that simple frame alignment already provides enough essence without temporal attention.
Use Your Head: Improving Long-Tail Video Recognition
We demonstrate that, unlike naturally-collected video datasets and existing long-tail image benchmarks, current video benchmarks fall short on multiple long-tailed properties.
Frame Flexible Network
To fix this issue, we propose a general framework, named Frame Flexible Network (FFN), which not only enables the model to be evaluated at different frames to adjust its computation, but also reduces the memory costs of storing multiple models significantly.
The effectiveness of MAE pre-pretraining for billion-scale pretraining
While MAE has only been shown to scale with the size of models, we find that it scales with the size of the training dataset as well.
Making Vision Transformers Efficient from A Token Sparsification View
In this work, we propose a novel Semantic Token ViT (STViT), for efficient global and local vision transformers, which can also be revised to serve as backbone for downstream tasks.
MAtch, eXpand and Improve: Unsupervised Finetuning for Zero-Shot Action Recognition with Language Knowledge
We adapt a VL model for zero-shot and few-shot action recognition using a collection of unlabeled videos and an unpaired action dictionary.
Maximizing Spatio-Temporal Entropy of Deep 3D CNNs for Efficient Video Recognition
In this work, we propose to automatically design efficient 3D CNN architectures via a novel training-free neural architecture search approach tailored for 3D CNNs considering the model complexity.