Self-supervised Spatio-temporal Representation Learning for Videos by Predicting Motion and Appearance Statistics

CVPR 2019 Jiangliu WangJianbo JiaoLinchao BaoShengfeng HeYunhui LiuWei Liu

We address the problem of video representation learning without human-annotated labels. While previous efforts address the problem by designing novel self-supervised tasks using video data, the learned features are merely on a frame-by-frame basis, which are not applicable to many video analytic tasks where spatio-temporal features are prevailing... (read more)

PDF Abstract
TASK DATASET MODEL METRIC NAME METRIC VALUE GLOBAL RANK USES EXTRA
TRAINING DATA
RESULT BENCHMARK
Self-Supervised Action Recognition HMDB51 Motion & Appearance (C3D) Top-1 Accuracy 20.3 # 8
Pre-Training Dataset UCF101 # 1
Action Recognition HMDB-51 Pretrained on Kinetics Average accuracy of 3 splits 33.4 # 27
Self-Supervised Action Recognition UCF101 Motion & Appearance (C3D) 3-fold Accuracy 58.8 # 12
Pre-Training Dataset UCF101 # 1
Action Recognition UCF101 Pretrained on Kinetics 3-fold Accuracy 61.2 # 32

Methods used in the Paper


METHOD TYPE
🤖 No Methods Found Help the community by adding them if they're not listed; e.g. Deep Residual Learning for Image Recognition uses ResNet