TASK	DATASET	MODEL	METRIC NAME	METRIC VALUE	GLOBAL RANK
Action Recognition	HMDB-51	ZeroI2V ViT-L/14	Average accuracy of 3 splits	83.4	# 11
Action Classification	Kinetics-400	ZeroI2V ViT-L/14	Acc@1	87.2	# 31
Action Classification	Kinetics-400	ZeroI2V ViT-L/14	Acc@5	97.6	# 18
Action Recognition	Something-Something V2	ZeroI2V ViT-L/14	Top-1 Accuracy	72.2	# 24
Action Recognition	Something-Something V2	ZeroI2V ViT-L/14	Top-5 Accuracy	93.0	# 19
Action Recognition	UCF101	ZeroI2V ViT-L/14	3-fold Accuracy	98.6	# 5

Badge	Markdown
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/zeroi2v-zero-cost-adaptation-of-pre-trained/action-recognition-in-videos-on-ucf101)](https://paperswithcode.com/sota/action-recognition-in-videos-on-ucf101?p=zeroi2v-zero-cost-adaptation-of-pre-trained)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/zeroi2v-zero-cost-adaptation-of-pre-trained/action-recognition-in-videos-on-hmdb-51)](https://paperswithcode.com/sota/action-recognition-in-videos-on-hmdb-51?p=zeroi2v-zero-cost-adaptation-of-pre-trained)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/zeroi2v-zero-cost-adaptation-of-pre-trained/action-recognition-in-videos-on-something)](https://paperswithcode.com/sota/action-recognition-in-videos-on-something?p=zeroi2v-zero-cost-adaptation-of-pre-trained)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/zeroi2v-zero-cost-adaptation-of-pre-trained/action-classification-on-kinetics-400)](https://paperswithcode.com/sota/action-classification-on-kinetics-400?p=zeroi2v-zero-cost-adaptation-of-pre-trained)`

ZeroI2V: Zero-Cost Adaptation of Pre-trained Transformers from Image to Video

2 Oct 2023 · Xinhao Li, LiMin Wang ·

Adapting image models to video domain is becoming an efficient paradigm for solving video recognition tasks. Due to the huge number of parameters and effective transferability of image models, performing full fine-tuning is less efficient and even unnecessary. Thus, recent research is shifting its focus towards parameter-efficient image-to-video adaptation. However, these adaptation strategies inevitably introduce extra computational cost to deal with the domain gap and temporal modeling in videos. In this paper, our goal is to present a zero-cost adaptation paradigm (ZeroI2V) to transfer the image transformers to video recognition tasks (i.e., introduce zero extra cost to the adapted models during inference). To achieve this goal, we present two core designs. First, to capture the dynamics in videos and reduce the difficulty of achieving image-to-video adaptation, we exploit the flexibility of self-attention and introduce the spatial-temporal dual-headed attention (STDHA) that efficiently endow the image transformers with temporal modeling capability at zero extra parameters and computation. Second, to handle the domain gap between images and videos, we propose a linear adaption strategy which utilizes lightweight densely placed linear adapters to fully transfer the frozen image models to video recognition. Due to its customized linear design, all newly added adapters could be easily merged with the original modules through structural reparameterization after training, thus achieving zero extra cost during inference. Extensive experiments on four widely-used video recognition benchmarks show that our ZeroI2V can match or even outperform previous state-of-the-art methods while enjoying superior parameter and inference efficiency.

PDF Abstract

Code

Add Remove Mark official

leexinhao/ZeroI2V official

Tasks

Add Remove

Action Classification

Action Recognition

Video Recognition

Datasets

UCF101

Kinetics

HMDB51

Kinetics 400

Something-Something V2

Results from the Paper

Edit

Ranked #5 on Action Recognition on UCF101 (using extra training data)

Get a GitHub badge

Task	Dataset	Model	Metric Name	Metric Value	Global Rank	Benchmark
Action Recognition	HMDB-51	ZeroI2V ViT-L/14	Average accuracy of 3 splits	83.4	# 11	Compare
Action Classification	Kinetics-400	ZeroI2V ViT-L/14	Acc@1	87.2	# 31	Compare
Action Classification	Kinetics-400	ZeroI2V ViT-L/14	Acc@5	97.6	# 18	Compare
Action Recognition	Something-Something V2	ZeroI2V ViT-L/14	Top-1 Accuracy	72.2	# 24	Compare
Action Recognition	Something-Something V2	ZeroI2V ViT-L/14	Top-5 Accuracy	93.0	# 19	Compare
Action Recognition	UCF101	ZeroI2V ViT-L/14	3-fold Accuracy	98.6	# 5	Compare

Methods

Add Remove

Focus

Edit Social Preview

ZeroI2V: Zero-Cost Adaptation of Pre-trained Transformers from Image to Video

Code Edit Add Remove Mark official

Tasks Edit Add Remove

Datasets Edit

Results from the Paper Edit

Methods Edit Add Remove

Code

Add Remove Mark official

Tasks

Add Remove

Datasets

Results from the Paper

Edit

Methods

Add Remove