TASK	DATASET	MODEL	METRIC NAME	METRIC VALUE	GLOBAL RANK
Self-Supervised Action Recognition	HMDB51	Video Clip Ordering (R3D)	Top-1 Accuracy	29.5	# 45
Self-Supervised Action Recognition	HMDB51	Video Clip Ordering (R3D)	Pre-Training Dataset	UCF101	# 1
Self-Supervised Action Recognition	HMDB51	Video Clip Ordering (R3D)	Frozen	false	# 1
Self-Supervised Action Recognition	UCF101	Video Clip Ordering (R3D)	3-fold Accuracy	64.9	# 43
Self-Supervised Action Recognition	UCF101	Video Clip Ordering (R3D)	Pre-Training Dataset	UCF101	# 1
Self-Supervised Action Recognition	UCF101	Video Clip Ordering (R3D)	Frozen	false	# 1

Badge	Markdown
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/self-supervised-spatiotemporal-learning-via/self-supervised-action-recognition-on-ucf101)](https://paperswithcode.com/sota/self-supervised-action-recognition-on-ucf101?p=self-supervised-spatiotemporal-learning-via)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/self-supervised-spatiotemporal-learning-via/self-supervised-action-recognition-on-hmdb51)](https://paperswithcode.com/sota/self-supervised-action-recognition-on-hmdb51?p=self-supervised-spatiotemporal-learning-via)`

Self-Supervised Spatiotemporal Learning via Video Clip Order Prediction

CVPR 2019 · Dejing Xu, Jun Xiao, Zhou Zhao, Jian Shao, Di Xie, Yueting Zhuang ·

We propose a self-supervised spatiotemporal learning technique which leverages the chronological order of videos. Our method can learn the spatiotemporal representation of the video by predicting the order of shuffled clips from the video. The category of the video is not required, which gives our technique the potential to take advantage of infinite unannotated videos. There exist related works which use frames, while compared to frames, clips are more consistent with the video dynamics. Clips can help to reduce the uncertainty of orders and are more appropriate to learn a video representation. The 3D convolutional neural networks are utilized to extract features for clips, and these features are processed to predict the actual order. The learned representations are evaluated via nearest neighbor retrieval experiments. We also use the learned networks as the pre-trained models and finetune them on the action recognition task. Three types of 3D convolutional neural networks are tested in experiments, and we gain large improvements compared to existing self-supervised methods.

PDF Abstract

Code

Add Remove Mark official

No code implementations yet. Submit your code now

Tasks

Add Remove

Action Recognition

Retrieval

Self-Supervised Action Recognition

Temporal Action Localization

Datasets

ImageNet

UCF101

HMDB51

Results from the Paper

Add Remove

Ranked #43 on Self-Supervised Action Recognition on UCF101

Get a GitHub badge

Results from Other Papers

Task	Dataset	Model	Metric Name	Metric Value	Rank	Compare
Self-Supervised Action Recognition	HMDB51	Video Clip Ordering (R3D)	Top-1 Accuracy	29.5	# 45	See all
			Pre-Training Dataset	UCF101	# 1	See all
			Frozen	false	# 1	See all
Self-Supervised Action Recognition	UCF101	Video Clip Ordering (R3D)	3-fold Accuracy	64.9	# 43	See all
			Pre-Training Dataset	UCF101	# 1	See all
			Frozen	false	# 1	See all

Methods

Add Remove

No methods listed for this paper. Add relevant methods here

Edit Social Preview

Self-Supervised Spatiotemporal Learning via Video Clip Order Prediction

Code Edit Add Remove Mark official

Tasks Edit Add Remove

Datasets Edit

Results from the Paper Edit Add Remove