Video Classification
172 papers with code • 11 benchmarks • 17 datasets
Video Classification is the task of producing a label that is relevant to the video given its frames. A good video level classifier is one that not only provides accurate frame labels, but also best describes the entire video given the features and the annotations of the various frames in the video. For example, a video might contain a tree in some frame, but the label that is central to the video might be something else (e.g., “hiking”). The granularity of the labels that are needed to describe the frames and the video depends on the task. Typical tasks include assigning one or more global labels to the video, and assigning one or more labels for each frame inside the video.
Libraries
Use these libraries to find Video Classification models and implementationsDatasets
Latest papers with no code
Efficient Selective Audio Masked Multimodal Bottleneck Transformer for Audio-Video Classification
To learn from multimodal videos effectively, in this work, we propose a novel audio-video recognition approach termed audio video Transformer, AVT, leveraging the effective spatio-temporal representation by the video Transformer to improve action recognition accuracy.
Neural architecture impact on identifying temporally extended Reinforcement Learning tasks
In addition, motivated by recent developments in attention based video-classification models using Vision Transformer, we come up with an architecture based on Vision Transformer, for image-based RL domain too.
Revisiting Kernel Temporal Segmentation as an Adaptive Tokenizer for Long-form Video Understanding
While most modern video understanding models operate on short-range clips, real-world videos are often several minutes long with semantically consistent segments of variable length.
Language as the Medium: Multimodal Video Classification through text only
Despite an exciting new wave of multimodal machine learning models, current approaches still struggle to interpret the complex contextual relationships between the different modalities present in videos.
AV-MaskEnhancer: Enhancing Video Representations through Audio-Visual Masked Autoencoder
Learning high-quality video representation has shown significant applications in computer vision and remains challenging.
The Staged Knowledge Distillation in Video Classification: Harmonizing Student Progress by a Complementary Weakly Supervised Framework
Our proposed substage-based distillation approach has the potential to inform future research on label-efficient learning for video data.
Active Learning for Video Classification with Frame Level Queries
To the best of our knowledge, this is the first research effort to develop an active learning framework for video classification, where the annotators need to inspect only a few frames to produce a label, rather than watching the end-to-end video.
Boosting Breast Ultrasound Video Classification by the Guidance of Keyframe Feature Centers
The coherence loss uses the feature centers generated by the static images to guide the frame attention in the video model.
Multi-label Video Classification for Underwater Ship Inspection
Today ship hull inspection including the examination of the external coating, detection of defects, and other types of external degradation such as corrosion and marine growth is conducted underwater by means of Remotely Operated Vehicles (ROVs).
Alternating Gradient Descent and Mixture-of-Experts for Integrated Multimodal Perception
We conduct extensive empirical studies and reveal the following key insights: 1) Performing gradient descent updates by alternating on diverse modalities, loss functions, and tasks, with varying input resolutions, efficiently improves the model.