Video Classification

178 papers with code • 11 benchmarks • 17 datasets

Video Classification is the task of producing a label that is relevant to the video given its frames. A good video level classifier is one that not only provides accurate frame labels, but also best describes the entire video given the features and the annotations of the various frames in the video. For example, a video might contain a tree in some frame, but the label that is central to the video might be something else (e.g., “hiking”). The granularity of the labels that are needed to describe the frames and the video depends on the task. Typical tasks include assigning one or more global labels to the video, and assigning one or more labels for each frame inside the video.

Source: Efficient Large Scale Video Classification

Libraries

Use these libraries to find Video Classification models and implementations

Most implemented papers

TNT: Text-Conditioned Network with Transductive Inference for Few-Shot Video Classification

ojedaf/TNT 21 Jun 2021

Recently, few-shot learning has received increasing interest.

MorphMLP: An Efficient MLP-Like Backbone for Spatial-Temporal Representation Learning

MTLab/MorphMLP 24 Nov 2021

With such multi-dimension and multi-scale factorization, our MorphMLP block can achieve a great accuracy-computation balance.

A Dataset for Medical Instructional Video Classification and Question Answering

deepaknlp/medvidqacl 30 Jan 2022

This paper introduces a new challenge and datasets to foster research toward designing systems that can understand medical videos and provide visual answers to natural language questions.

Temporal and cross-modal attention for audio-visual zero-shot learning

explainableml/tcaf-gzsl 20 Jul 2022

We show that our proposed framework that ingests temporal features yields state-of-the-art performance on the \ucf, \vgg, and \activity benchmarks for (generalised) zero-shot learning.

TAD: A Large-Scale Benchmark for Traffic Accidents Detection from Video Surveillance

forAcadamic/T-CAD 26 Sep 2022

After integration and annotation by various dimensions, a large-scale traffic accidents dataset named TAD is proposed in this work.

InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks

opengvlab/internvl 21 Dec 2023

However, the progress in vision and vision-language foundation models, which are also critical elements of multi-modal AGI, has not kept pace with LLMs.

Large-Scale Video Classification with Convolutional Neural Networks

lRomul/ball-action-spotting 2014 IEEE Conference on Computer Vision and Pattern Recognition 2014

We further study the generalization performance of our best model by retraining the top layers on the UCF-101 Action Recognition dataset and observe significant performance improvements compared to the UCF-101 baseline model (63. 3% up from 43. 9%).

Beyond Short Snippets: Deep Networks for Video Classification

shobrook/sequitur CVPR 2015

Convolutional neural networks (CNNs) have been extensively applied for image recognition problems giving state-of-the-art results on recognition, detection, segmentation and retrieval.

Modeling Spatial-Temporal Clues in a Hybrid Deep Learning Framework for Video Classification

tejgvsl/Camera-motion-classification-in-a-video-file 7 Apr 2015

In this paper, we propose a hybrid deep learning framework for video classification, which is able to model static spatial information, short-term motion, as well as long-term temporal clues in the videos.

ActivityNet: A Large-Scale Video Benchmark for Human Activity Understanding

Munsebum/AI-Project_A1 CVPR 2015

In spite of many dataset efforts for human action recognition, current computer vision algorithms are still severely limited in terms of the variability and complexity of the actions that they can recognize.