Self-Supervised Spatiotemporal Learning via Video Clip Order Prediction

We propose a self-supervised spatiotemporal learning technique which leverages the chronological order of videos. Our method can learn the spatiotemporal representation of the video by predicting the order of shuffled clips from the video. The category of the video is not required, which gives our technique the potential to take advantage of infinite unannotated videos. There exist related works which use frames, while compared to frames, clips are more consistent with the video dynamics. Clips can help to reduce the uncertainty of orders and are more appropriate to learn a video representation. The 3D convolutional neural networks are utilized to extract features for clips, and these features are processed to predict the actual order. The learned representations are evaluated via nearest neighbor retrieval experiments. We also use the learned networks as the pre-trained models and finetune them on the action recognition task. Three types of 3D convolutional neural networks are tested in experiments, and we gain large improvements compared to existing self-supervised methods.

PDF Abstract

Results from Other Papers


Task Dataset Model Metric Name Metric Value Rank Source Paper Compare
Self-Supervised Action Recognition HMDB51 Video Clip Ordering (R3D) Top-1 Accuracy 29.5 # 45
Pre-Training Dataset UCF101 # 1
Frozen false # 1
Self-Supervised Action Recognition UCF101 Video Clip Ordering (R3D) 3-fold Accuracy 64.9 # 43
Pre-Training Dataset UCF101 # 1
Frozen false # 1

Methods


No methods listed for this paper. Add relevant methods here