TASK	DATASET	MODEL	METRIC NAME	METRIC VALUE	GLOBAL RANK
Self-Supervised Action Recognition	HMDB51	3D RotNet (3D ResNet-18)	Top-1 Accuracy	33.7	# 42
Self-Supervised Action Recognition	HMDB51	3D RotNet (3D ResNet-18)	Pre-Training Dataset	Kinetics400	# 1
Self-Supervised Action Recognition	HMDB51	3D RotNet (3D ResNet-18)	Frozen	false	# 1
Self-Supervised Action Recognition	UCF101	3D RotNet (3D ResNet-18)	3-fold Accuracy	62.9	# 45
Self-Supervised Action Recognition	UCF101	3D RotNet (3D ResNet-18)	Pre-Training Dataset	Kinetics400	# 1
Self-Supervised Action Recognition	UCF101	3D RotNet (3D ResNet-18)	Frozen	false	# 1

Badge	Markdown
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/self-supervised-spatiotemporal-feature/self-supervised-action-recognition-on-hmdb51)](https://paperswithcode.com/sota/self-supervised-action-recognition-on-hmdb51?p=self-supervised-spatiotemporal-feature)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/self-supervised-spatiotemporal-feature/self-supervised-action-recognition-on-ucf101)](https://paperswithcode.com/sota/self-supervised-action-recognition-on-ucf101?p=self-supervised-spatiotemporal-feature)`

Self-Supervised Spatiotemporal Feature Learning via Video Rotation Prediction

28 Nov 2018 · Longlong Jing, Xiaodong Yang, Jingen Liu, YingLi Tian ·

The success of deep neural networks generally requires a vast amount of training data to be labeled, which is expensive and unfeasible in scale, especially for video collections. To alleviate this problem, in this paper, we propose 3DRotNet: a fully self-supervised approach to learn spatiotemporal features from unlabeled videos. A set of rotations are applied to all videos, and a pretext task is defined as prediction of these rotations. When accomplishing this task, 3DRotNet is actually trained to understand the semantic concepts and motions in videos. In other words, it learns a spatiotemporal video representation, which can be transferred to improve video understanding tasks in small datasets. Our extensive experiments successfully demonstrate the effectiveness of the proposed framework on action recognition, leading to significant improvements over the state-of-the-art self-supervised methods. With the self-supervised pre-trained 3DRotNet from large datasets, the recognition accuracy is boosted up by 20.4% on UCF101 and 16.7% on HMDB51 respectively, compared to the models trained from scratch.

PDF Abstract

Code

Add Remove Mark official

No code implementations yet. Submit your code now

Tasks

Add Remove

Action Recognition

Self-Supervised Action Recognition

Temporal Action Localization

Video Understanding

Datasets

UCF101

Kinetics

HMDB51

Results from the Paper

Edit

Ranked #42 on Self-Supervised Action Recognition on HMDB51

Get a GitHub badge

Task	Dataset	Model	Metric Name	Metric Value	Global Rank	Benchmark
Self-Supervised Action Recognition	HMDB51	3D RotNet (3D ResNet-18)	Top-1 Accuracy	33.7	# 42	Compare
			Pre-Training Dataset	Kinetics400	# 1	Compare
			Frozen	false	# 1	Compare
Self-Supervised Action Recognition	UCF101	3D RotNet (3D ResNet-18)	3-fold Accuracy	62.9	# 45	Compare
			Pre-Training Dataset	Kinetics400	# 1	Compare
			Frozen	false	# 1	Compare

Methods

Add Remove

1x1 Convolution • Average Pooling • Batch Normalization • Bottleneck Residual Block • Convolution • Global Average Pooling • Kaiming Initialization • Max Pooling • ReLU • Residual Block • Residual Connection • ResNet

Edit Social Preview

Self-Supervised Spatiotemporal Feature Learning via Video Rotation Prediction

Code Edit Add Remove Mark official

Tasks Edit Add Remove

Datasets Edit

Results from the Paper Edit

Methods Edit Add Remove

Code

Add Remove Mark official

Tasks

Add Remove

Datasets

Results from the Paper

Edit

Methods

Add Remove