Video Classification

172 papers with code • 11 benchmarks • 17 datasets

Video Classification is the task of producing a label that is relevant to the video given its frames. A good video level classifier is one that not only provides accurate frame labels, but also best describes the entire video given the features and the annotations of the various frames in the video. For example, a video might contain a tree in some frame, but the label that is central to the video might be something else (e.g., “hiking”). The granularity of the labels that are needed to describe the frames and the video depends on the task. Typical tasks include assigning one or more global labels to the video, and assigning one or more labels for each frame inside the video.

Source: Efficient Large Scale Video Classification

Libraries

Use these libraries to find Video Classification models and implementations

Latest papers with no code

Efficient Selective Audio Masked Multimodal Bottleneck Transformer for Audio-Video Classification

no code yet • 8 Jan 2024

To learn from multimodal videos effectively, in this work, we propose a novel audio-video recognition approach termed audio video Transformer, AVT, leveraging the effective spatio-temporal representation by the video Transformer to improve action recognition accuracy.

Neural architecture impact on identifying temporally extended Reinforcement Learning tasks

no code yet • 4 Oct 2023

In addition, motivated by recent developments in attention based video-classification models using Vision Transformer, we come up with an architecture based on Vision Transformer, for image-based RL domain too.

Revisiting Kernel Temporal Segmentation as an Adaptive Tokenizer for Long-form Video Understanding

no code yet • 20 Sep 2023

While most modern video understanding models operate on short-range clips, real-world videos are often several minutes long with semantically consistent segments of variable length.

Language as the Medium: Multimodal Video Classification through text only

no code yet • 19 Sep 2023

Despite an exciting new wave of multimodal machine learning models, current approaches still struggle to interpret the complex contextual relationships between the different modalities present in videos.

AV-MaskEnhancer: Enhancing Video Representations through Audio-Visual Masked Autoencoder

no code yet • 15 Sep 2023

Learning high-quality video representation has shown significant applications in computer vision and remains challenging.

The Staged Knowledge Distillation in Video Classification: Harmonizing Student Progress by a Complementary Weakly Supervised Framework

no code yet • 11 Jul 2023

Our proposed substage-based distillation approach has the potential to inform future research on label-efficient learning for video data.

Active Learning for Video Classification with Frame Level Queries

no code yet • International Joint Conference on Neural Networks (IJCNN) 2023

To the best of our knowledge, this is the first research effort to develop an active learning framework for video classification, where the annotators need to inspect only a few frames to produce a label, rather than watching the end-to-end video.

Boosting Breast Ultrasound Video Classification by the Guidance of Keyframe Feature Centers

no code yet • 12 Jun 2023

The coherence loss uses the feature centers generated by the static images to guide the frame attention in the video model.

Multi-label Video Classification for Underwater Ship Inspection

no code yet • 27 May 2023

Today ship hull inspection including the examination of the external coating, detection of defects, and other types of external degradation such as corrosion and marine growth is conducted underwater by means of Remotely Operated Vehicles (ROVs).

Alternating Gradient Descent and Mixture-of-Experts for Integrated Multimodal Perception

no code yet • NeurIPS 2023

We conduct extensive empirical studies and reveal the following key insights: 1) Performing gradient descent updates by alternating on diverse modalities, loss functions, and tasks, with varying input resolutions, efficiently improves the model.