Video Classification
172 papers with code • 11 benchmarks • 17 datasets
Video Classification is the task of producing a label that is relevant to the video given its frames. A good video level classifier is one that not only provides accurate frame labels, but also best describes the entire video given the features and the annotations of the various frames in the video. For example, a video might contain a tree in some frame, but the label that is central to the video might be something else (e.g., “hiking”). The granularity of the labels that are needed to describe the frames and the video depends on the task. Typical tasks include assigning one or more global labels to the video, and assigning one or more labels for each frame inside the video.
Libraries
Use these libraries to find Video Classification models and implementationsDatasets
Latest papers
Text-to-feature diffusion for audio-visual few-shot learning
Training deep learning models for video classification from audio-visual data commonly requires immense amounts of labeled training data collected via a costly process.
Identifying Misinformation on YouTube through Transcript Contextual Analysis with Transformer Models
We apply the trained models to three datasets: (a) YouTube Vaccine-misinformation related videos, (b) YouTube Pseudoscience videos, and (c) Fake-News dataset (a collection of articles).
MUVF-YOLOX: A Multi-modal Ultrasound Video Fusion Network for Renal Tumor Diagnosis
In addition, we design an object-level temporal aggregation (OTA) module that can automatically filter low-quality features and efficiently integrate temporal information from multiple frames to improve the accuracy of tumor diagnosis.
Patch n' Pack: NaViT, a Vision Transformer for any Aspect Ratio and Resolution
The ubiquitous and demonstrably suboptimal choice of resizing images to a fixed resolution before processing them with computer vision models has not yet been successfully challenged.
Learning Unseen Modality Interaction
Multimodal learning assumes all modality combinations of interest are available during training to learn cross-modal correspondences.
Inflated 3D Convolution-Transformer for Weakly-supervised Carotid Stenosis Grading with Ultrasound Videos
First, to avoid the requirement of laborious and unreliable annotation, we propose a novel and effective video classification network for weakly-supervised CSG.
Malicious or Benign? Towards Effective Content Moderation for Children's Videos
Online video platforms receive hundreds of hours of uploads every minute, making manual content moderation impossible.
HateMM: A Multi-Modal Dataset for Hate Video Classification
Hate speech has become one of the most significant issues in modern society, having implications in both the online and the offline world.
Verbs in Action: Improving verb understanding in video-language models
Understanding verbs is crucial to modelling how people and objects interact with each other and the environment through space and time.
SparseFormer: Sparse Visual Recognition via Limited Latent Tokens
In this paper, we challenge this dense paradigm and present a new method, coined SparseFormer, to imitate human's sparse visual recognition in an end-to-end manner.