TASK	DATASET	MODEL	METRIC NAME	METRIC VALUE	GLOBAL RANK
Zero-Shot Action Recognition	HMDB51	OTI(ViT-L/14)	Top-1 Accuracy	64	# 2
Zero-Shot Action Recognition	Kinetics	OTI（ViT-L/14）	Top-1 Accuracy	70.6	# 4
Zero-Shot Action Recognition	UCF101	OTI(ViT-L/14)	Top-1 Accuracy	92.8	# 1

Badge	Markdown
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/orthogonal-temporal-interpolation-for-zero/zero-shot-action-recognition-on-ucf101)](https://paperswithcode.com/sota/zero-shot-action-recognition-on-ucf101?p=orthogonal-temporal-interpolation-for-zero)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/orthogonal-temporal-interpolation-for-zero/zero-shot-action-recognition-on-hmdb51)](https://paperswithcode.com/sota/zero-shot-action-recognition-on-hmdb51?p=orthogonal-temporal-interpolation-for-zero)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/orthogonal-temporal-interpolation-for-zero/zero-shot-action-recognition-on-kinetics)](https://paperswithcode.com/sota/zero-shot-action-recognition-on-kinetics?p=orthogonal-temporal-interpolation-for-zero)`

Orthogonal Temporal Interpolation for Zero-Shot Video Recognition

14 Aug 2023 · Yan Zhu, Junbao Zhuo, Bin Ma, Jiajia Geng, Xiaoming Wei, Xiaolin Wei, Shuhui Wang ·

Zero-shot video recognition (ZSVR) is a task that aims to recognize video categories that have not been seen during the model training process. Recently, vision-language models (VLMs) pre-trained on large-scale image-text pairs have demonstrated impressive transferability for ZSVR. To make VLMs applicable to the video domain, existing methods often use an additional temporal learning module after the image-level encoder to learn the temporal relationships among video frames. Unfortunately, for video from unseen categories, we observe an abnormal phenomenon where the model that uses spatial-temporal feature performs much worse than the model that removes temporal learning module and uses only spatial feature. We conjecture that improper temporal modeling on video disrupts the spatial feature of the video. To verify our hypothesis, we propose Feature Factorization to retain the orthogonal temporal feature of the video and use interpolation to construct refined spatial-temporal feature. The model using appropriately refined spatial-temporal feature performs better than the one using only spatial feature, which verifies the effectiveness of the orthogonal temporal feature for the ZSVR task. Therefore, an Orthogonal Temporal Interpolation module is designed to learn a better refined spatial-temporal video feature during training. Additionally, a Matching Loss is introduced to improve the quality of the orthogonal temporal feature. We propose a model called OTI for ZSVR by employing orthogonal temporal interpolation and the matching loss based on VLMs. The ZSVR accuracies on popular video datasets (i.e., Kinetics-600, UCF101 and HMDB51) show that OTI outperforms the previous state-of-the-art method by a clear margin.

PDF Abstract

Code

Add Remove Mark official

sweetorangezhuyan/mm2023_oti official

Tasks

Add Remove

Video Recognition

Zero-Shot Action Recognition

Zero-Shot Action Recognition on HMDB51

Zero-Shot Action Recognition on UCF101

Datasets

UCF101

Kinetics

HMDB51

Kinetics 400

Kinetics-600

Results from the Paper

Edit

Ranked #1 on Zero-Shot Action Recognition on UCF101

Get a GitHub badge

Task	Dataset	Model	Metric Name	Metric Value	Global Rank	Benchmark
Zero-Shot Action Recognition	HMDB51	OTI(ViT-L/14)	Top-1 Accuracy	64	# 2	Compare
Zero-Shot Action Recognition	Kinetics	OTI（ViT-L/14）	Top-1 Accuracy	70.6	# 4	Compare
Zero-Shot Action Recognition	UCF101	OTI(ViT-L/14)	Top-1 Accuracy	92.8	# 1	Compare

Methods

Add Remove

No methods listed for this paper. Add relevant methods here

Edit Social Preview

Orthogonal Temporal Interpolation for Zero-Shot Video Recognition

Code Edit Add Remove Mark official

Tasks Edit Add Remove

Datasets Edit

Results from the Paper Edit

Methods Edit Add Remove

Code

Add Remove Mark official

Tasks

Add Remove

Datasets

Results from the Paper

Edit

Methods

Add Remove