TASK	DATASET	MODEL	METRIC NAME	METRIC VALUE	GLOBAL RANK
Action Recognition In Videos	HMDB-51	STM (ImageNet+Kinetics pretrain)	Average accuracy of 3 splits	72.2	# 1
Action Recognition In Videos	Jester (Gesture Recognition)	STM (Resnet-50, 16 frames)	Val	96.7	# 1
Action Classification	Kinetics-400	STM (ResNet-50)	Acc@1	73.7	# 160
Action Recognition In Videos	Something-Something V1	STM (16 frames, ImageNet pretraining)	Top 1 Accuracy	50.7	# 1
Action Recognition In Videos	Something-Something V2	STM (16 frames, ImageNet pretraining)	Top-1 Accuracy	64.2	# 1
Action Recognition In Videos	Something-Something V2	STM (16 frames, ImageNet pretraining)	Top-5 Accuracy	89.8	# 1
Action Recognition In Videos	UCF101	STM (ImageNet+Kinetics pretrain)	3-fold Accuracy	96.2	# 1

Badge	Markdown
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/stm-spatiotemporal-and-motion-encoding-for/action-recognition-in-videos-on-hmdb-51-1)](https://paperswithcode.com/sota/action-recognition-in-videos-on-hmdb-51-1?p=stm-spatiotemporal-and-motion-encoding-for)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/stm-spatiotemporal-and-motion-encoding-for/action-recognition-in-videos-on-jester-1)](https://paperswithcode.com/sota/action-recognition-in-videos-on-jester-1?p=stm-spatiotemporal-and-motion-encoding-for)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/stm-spatiotemporal-and-motion-encoding-for/action-recognition-in-videos-on-something-2)](https://paperswithcode.com/sota/action-recognition-in-videos-on-something-2?p=stm-spatiotemporal-and-motion-encoding-for)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/stm-spatiotemporal-and-motion-encoding-for/action-recognition-in-videos-on-something-3)](https://paperswithcode.com/sota/action-recognition-in-videos-on-something-3?p=stm-spatiotemporal-and-motion-encoding-for)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/stm-spatiotemporal-and-motion-encoding-for/action-recognition-in-videos-on-ucf101-2)](https://paperswithcode.com/sota/action-recognition-in-videos-on-ucf101-2?p=stm-spatiotemporal-and-motion-encoding-for)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/stm-spatiotemporal-and-motion-encoding-for/action-classification-on-kinetics-400)](https://paperswithcode.com/sota/action-classification-on-kinetics-400?p=stm-spatiotemporal-and-motion-encoding-for)`

STM: SpatioTemporal and Motion Encoding for Action Recognition

ICCV 2019 · Boyuan Jiang, Mengmeng Wang, Weihao Gan, Wei Wu, Junjie Yan ·

Spatiotemporal and motion features are two complementary and crucial information for video action recognition. Recent state-of-the-art methods adopt a 3D CNN stream to learn spatiotemporal features and another flow stream to learn motion features. In this work, we aim to efficiently encode these two features in a unified 2D framework. To this end, we first propose an STM block, which contains a Channel-wise SpatioTemporal Module (CSTM) to present the spatiotemporal features and a Channel-wise Motion Module (CMM) to efficiently encode motion features. We then replace original residual blocks in the ResNet architecture with STM blcoks to form a simple yet effective STM network by introducing very limited extra computation cost. Extensive experiments demonstrate that the proposed STM network outperforms the state-of-the-art methods on both temporal-related datasets (i.e., Something-Something v1 & v2 and Jester) and scene-related datasets (i.e., Kinetics-400, UCF-101, and HMDB-51) with the help of encoding spatiotemporal and motion features together.

PDF Abstract ICCV 2019 PDF ICCV 2019 Abstract

Code

Add Remove Mark official

No code implementations yet. Submit your code now

Tasks

Add Remove

Action Classification

Action Recognition

Action Recognition In Videos

Temporal Action Localization

Datasets

UCF101

Kinetics

HMDB51

Kinetics 400

Something-Something V2

Something-Something V1

Jester (Gesture Recognition)

Results from the Paper

Edit

Ranked #1 on Action Recognition In Videos on HMDB-51

Get a GitHub badge

Task	Dataset	Model	Metric Name	Metric Value	Global Rank	Benchmark
Action Recognition In Videos	HMDB-51	STM (ImageNet+Kinetics pretrain)	Average accuracy of 3 splits	72.2	# 1	Compare
Action Recognition In Videos	Jester (Gesture Recognition)	STM (Resnet-50, 16 frames)	Val	96.7	# 1	Compare
Action Classification	Kinetics-400	STM (ResNet-50)	Acc@1	73.7	# 160	Compare
Action Recognition In Videos	Something-Something V1	STM (16 frames, ImageNet pretraining)	Top 1 Accuracy	50.7	# 1	Compare
Action Recognition In Videos	Something-Something V2	STM (16 frames, ImageNet pretraining)	Top-1 Accuracy	64.2	# 1	Compare
Action Recognition In Videos	Something-Something V2	STM (16 frames, ImageNet pretraining)	Top-5 Accuracy	89.8	# 1	Compare
Action Recognition In Videos	UCF101	STM (ImageNet+Kinetics pretrain)	3-fold Accuracy	96.2	# 1	Compare

Methods

Add Remove

1x1 Convolution • Average Pooling • Batch Normalization • Bottleneck Residual Block • Convolution • Global Average Pooling • Kaiming Initialization • Max Pooling • ReLU • Residual Block • Residual Connection • ResNet

Edit Social Preview

STM: SpatioTemporal and Motion Encoding for Action Recognition

Code Edit Add Remove Mark official

Tasks Edit Add Remove

Datasets Edit

Results from the Paper Edit

Methods Edit Add Remove

Code

Add Remove Mark official

Tasks

Add Remove

Datasets

Results from the Paper

Edit

Methods

Add Remove