TASK	DATASET	MODEL	METRIC NAME	METRIC VALUE	GLOBAL RANK
Action Classification	Kinetics-400	ST-Adapter (ViT-L, CLIP)	Acc@1	87.2	# 31
Action Classification	Kinetics-400	ST-Adapter (ViT-L, CLIP)	Acc@5	97.6	# 18
Action Recognition	Something-Something V2	ST-Adapter (ViT-L, CLIP)	Top-1 Accuracy	72.3	# 23
Action Recognition	Something-Something V2	ST-Adapter (ViT-L, CLIP)	Top-5 Accuracy	93.9	# 16
Action Recognition	Something-Something V2	ST-Adapter (ViT-L, CLIP)	GFLOPs	8248	# 1

Badge	Markdown
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/parameter-efficient-image-to-video-transfer/action-recognition-in-videos-on-something)](https://paperswithcode.com/sota/action-recognition-in-videos-on-something?p=parameter-efficient-image-to-video-transfer)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/parameter-efficient-image-to-video-transfer/action-classification-on-kinetics-400)](https://paperswithcode.com/sota/action-classification-on-kinetics-400?p=parameter-efficient-image-to-video-transfer)`

ST-Adapter: Parameter-Efficient Image-to-Video Transfer Learning

27 Jun 2022 · Junting Pan, Ziyi Lin, Xiatian Zhu, Jing Shao, Hongsheng Li ·

Capitalizing on large pre-trained models for various downstream tasks of interest have recently emerged with promising performance. Due to the ever-growing model size, the standard full fine-tuning based task adaptation strategy becomes prohibitively costly in terms of model training and storage. This has led to a new research direction in parameter-efficient transfer learning. However, existing attempts typically focus on downstream tasks from the same modality (e.g., image understanding) of the pre-trained model. This creates a limit because in some specific modalities, (e.g., video understanding) such a strong pre-trained model with sufficient knowledge is less or not available. In this work, we investigate such a novel cross-modality transfer learning setting, namely parameter-efficient image-to-video transfer learning. To solve this problem, we propose a new Spatio-Temporal Adapter (ST-Adapter) for parameter-efficient fine-tuning per video task. With a built-in spatio-temporal reasoning capability in a compact design, ST-Adapter enables a pre-trained image model without temporal knowledge to reason about dynamic video content at a small (~8%) per-task parameter cost, requiring approximately 20 times fewer updated parameters compared to previous work. Extensive experiments on video action recognition tasks show that our ST-Adapter can match or even outperform the strong full fine-tuning strategy and state-of-the-art video models, whilst enjoying the advantage of parameter efficiency. The code and model are available at https://github.com/linziyi96/st-adapter

PDF Abstract

Code

Add Remove Mark official

linziyi96/st-adapter official

Tasks

Add Remove

Action Classification

Action Recognition

Temporal Action Localization

Transfer Learning

Video Understanding

Datasets

ImageNet

Kinetics

Kinetics 400

Something-Something V2

Results from the Paper

Edit

Ranked #23 on Action Recognition on Something-Something V2 (using extra training data)

Get a GitHub badge

Task	Dataset	Model	Metric Name	Metric Value	Global Rank	Benchmark
Action Classification	Kinetics-400	ST-Adapter (ViT-L, CLIP)	Acc@1	87.2	# 31	Compare
Action Classification	Kinetics-400	ST-Adapter (ViT-L, CLIP)	Acc@5	97.6	# 18	Compare
Action Recognition	Something-Something V2	ST-Adapter (ViT-L, CLIP)	Top-1 Accuracy	72.3	# 23	Compare
			Top-5 Accuracy	93.9	# 16	Compare
			GFLOPs	8248	# 1	Compare

Methods

Add Remove

Adapter

Edit Social Preview

ST-Adapter: Parameter-Efficient Image-to-Video Transfer Learning

Code Edit Add Remove Mark official

Tasks Edit Add Remove

Datasets Edit

Results from the Paper Edit

Methods Edit Add Remove

Code

Add Remove Mark official

Tasks

Add Remove

Datasets

Results from the Paper

Edit

Methods

Add Remove