Video Classification
178 papers with code • 11 benchmarks • 17 datasets
Video Classification is the task of producing a label that is relevant to the video given its frames. A good video level classifier is one that not only provides accurate frame labels, but also best describes the entire video given the features and the annotations of the various frames in the video. For example, a video might contain a tree in some frame, but the label that is central to the video might be something else (e.g., “hiking”). The granularity of the labels that are needed to describe the frames and the video depends on the task. Typical tasks include assigning one or more global labels to the video, and assigning one or more labels for each frame inside the video.
Libraries
Use these libraries to find Video Classification models and implementationsDatasets
Subtasks
Most implemented papers
TNT: Text-Conditioned Network with Transductive Inference for Few-Shot Video Classification
Recently, few-shot learning has received increasing interest.
MorphMLP: An Efficient MLP-Like Backbone for Spatial-Temporal Representation Learning
With such multi-dimension and multi-scale factorization, our MorphMLP block can achieve a great accuracy-computation balance.
A Dataset for Medical Instructional Video Classification and Question Answering
This paper introduces a new challenge and datasets to foster research toward designing systems that can understand medical videos and provide visual answers to natural language questions.
Temporal and cross-modal attention for audio-visual zero-shot learning
We show that our proposed framework that ingests temporal features yields state-of-the-art performance on the \ucf, \vgg, and \activity benchmarks for (generalised) zero-shot learning.
TAD: A Large-Scale Benchmark for Traffic Accidents Detection from Video Surveillance
After integration and annotation by various dimensions, a large-scale traffic accidents dataset named TAD is proposed in this work.
InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks
However, the progress in vision and vision-language foundation models, which are also critical elements of multi-modal AGI, has not kept pace with LLMs.
Large-Scale Video Classification with Convolutional Neural Networks
We further study the generalization performance of our best model by retraining the top layers on the UCF-101 Action Recognition dataset and observe significant performance improvements compared to the UCF-101 baseline model (63. 3% up from 43. 9%).
Beyond Short Snippets: Deep Networks for Video Classification
Convolutional neural networks (CNNs) have been extensively applied for image recognition problems giving state-of-the-art results on recognition, detection, segmentation and retrieval.
Modeling Spatial-Temporal Clues in a Hybrid Deep Learning Framework for Video Classification
In this paper, we propose a hybrid deep learning framework for video classification, which is able to model static spatial information, short-term motion, as well as long-term temporal clues in the videos.
ActivityNet: A Large-Scale Video Benchmark for Human Activity Understanding
In spite of many dataset efforts for human action recognition, current computer vision algorithms are still severely limited in terms of the variability and complexity of the actions that they can recognize.