TASK	DATASET	MODEL	METRIC NAME	METRIC VALUE	GLOBAL RANK
Action Classification	Kinetics-400	Side4Video (EVA, ViT-E/14)	Acc@1	88.6	# 18
Action Classification	Kinetics-400	Side4Video (EVA, ViT-E/14)	Acc@5	98.2	# 10
Video Retrieval	MSR-VTT-1kA	Side4Video	text-to-video Mean Rank	12.8	# 13
Video Retrieval	MSR-VTT-1kA	Side4Video	text-to-video R@1	52.3	# 13
Video Retrieval	MSR-VTT-1kA	Side4Video	text-to-video R@5	75.5	# 17
Video Retrieval	MSR-VTT-1kA	Side4Video	text-to-video R@10	84.2	# 16
Video Retrieval	MSR-VTT-1kA	Side4Video	text-to-video Median Rank	1.0	# 1
Video Retrieval	MSVD	Side4Video	text-to-video R@1	56.1	# 8
Video Retrieval	MSVD	Side4Video	text-to-video R@5	81.7	# 7
Video Retrieval	MSVD	Side4Video	text-to-video R@10	88.8	# 6
Video Retrieval	MSVD	Side4Video	text-to-video Median Rank	1.0	# 1
Video Retrieval	MSVD	Side4Video	text-to-video Mean Rank	8.4	# 4
Action Recognition	Something-Something V1	Side4Video (EVA ViT-E/14	Top 1 Accuracy	67.3	# 3
Action Recognition	Something-Something V1	Side4Video (EVA ViT-E/14	Top 5 Accuracy	88.8	# 2
Action Recognition	Something-Something V2	Side4Video (EVA ViT-E/14)	Top-1 Accuracy	75.2	# 10
Action Recognition	Something-Something V2	Side4Video (EVA ViT-E/14)	Top-5 Accuracy	94.0	# 13
Video Retrieval	VATEX	Side4Video	text-to-video R@1	68.8	# 6
Video Retrieval	VATEX	Side4Video	text-to-video R@50	1.0	# 3
Video Retrieval	VATEX	Side4Video	text-to-video R@10	97.0	# 4
Video Retrieval	VATEX	Side4Video	text-to-video R@5	93.5	# 2
Video Retrieval	VATEX	Side4Video	text-to-video MedianR	2.7	# 2

Badge	Markdown
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/side4video-spatial-temporal-side-network-for/action-recognition-in-videos-on-something-1)](https://paperswithcode.com/sota/action-recognition-in-videos-on-something-1?p=side4video-spatial-temporal-side-network-for)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/side4video-spatial-temporal-side-network-for/video-retrieval-on-vatex)](https://paperswithcode.com/sota/video-retrieval-on-vatex?p=side4video-spatial-temporal-side-network-for)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/side4video-spatial-temporal-side-network-for/video-retrieval-on-msvd)](https://paperswithcode.com/sota/video-retrieval-on-msvd?p=side4video-spatial-temporal-side-network-for)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/side4video-spatial-temporal-side-network-for/action-recognition-in-videos-on-something)](https://paperswithcode.com/sota/action-recognition-in-videos-on-something?p=side4video-spatial-temporal-side-network-for)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/side4video-spatial-temporal-side-network-for/video-retrieval-on-msr-vtt-1ka)](https://paperswithcode.com/sota/video-retrieval-on-msr-vtt-1ka?p=side4video-spatial-temporal-side-network-for)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/side4video-spatial-temporal-side-network-for/action-classification-on-kinetics-400)](https://paperswithcode.com/sota/action-classification-on-kinetics-400?p=side4video-spatial-temporal-side-network-for)`

Side4Video: Spatial-Temporal Side Network for Memory-Efficient Image-to-Video Transfer Learning

27 Nov 2023 · Huanjin Yao, Wenhao Wu, Zhiheng Li ·

Large pre-trained vision models achieve impressive success in computer vision. However, fully fine-tuning large models for downstream tasks, particularly in video understanding, can be prohibitively computationally expensive. Recent studies turn their focus towards efficient image-to-video transfer learning. Nevertheless, existing efficient fine-tuning methods lack attention to training memory usage and exploration of transferring a larger model to the video domain. In this paper, we present a novel Spatial-Temporal Side Network for memory-efficient fine-tuning large image models to video understanding, named Side4Video. Specifically, we introduce a lightweight spatial-temporal side network attached to the frozen vision model, which avoids the backpropagation through the heavy pre-trained model and utilizes multi-level spatial features from the original image model. Extremely memory-efficient architecture enables our method to reduce 75% memory usage than previous adapter-based methods. In this way, we can transfer a huge ViT-E (4.4B) for video understanding tasks which is 14x larger than ViT-L (304M). Our approach achieves remarkable performance on various video datasets across unimodal and cross-modal tasks (i.e., action recognition and text-video retrieval), especially in Something-Something V1&V2 (67.3% & 74.6%), Kinetics-400 (88.6%), MSR-VTT (52.3%), MSVD (56.1%) and VATEX (68.8%). We release our code at https://github.com/HJYao00/Side4Video.

PDF Abstract

Code

Add Remove Mark official

HJYao00/Side4Video official

whwu95/ATM

Tasks

Add Remove

Action Classification

Action Recognition

Transfer Learning

Video Retrieval

Video Understanding

Datasets

Kinetics

Kinetics 400

MSR-VTT

MSVD

Something-Something V2

Something-Something V1

VATEX

Results from the Paper

Edit

Ranked #3 on Action Recognition on Something-Something V1

Get a GitHub badge

Task	Dataset	Model	Metric Name	Metric Value	Global Rank	Benchmark
Action Classification	Kinetics-400	Side4Video (EVA, ViT-E/14)	Acc@1	88.6	# 18	Compare
Action Classification	Kinetics-400	Side4Video (EVA, ViT-E/14)	Acc@5	98.2	# 10	Compare
Video Retrieval	MSR-VTT-1kA	Side4Video	text-to-video Mean Rank	12.8	# 13	Compare
			text-to-video R@1	52.3	# 13	Compare
			text-to-video R@5	75.5	# 17	Compare
			text-to-video R@10	84.2	# 16	Compare
			text-to-video Median Rank	1.0	# 1	Compare
Video Retrieval	MSVD	Side4Video	text-to-video R@1	56.1	# 8	Compare
			text-to-video R@5	81.7	# 7	Compare
			text-to-video R@10	88.8	# 6	Compare
			text-to-video Median Rank	1.0	# 1	Compare
			text-to-video Mean Rank	8.4	# 4	Compare
Action Recognition	Something-Something V1	Side4Video (EVA ViT-E/14	Top 1 Accuracy	67.3	# 3	Compare
Action Recognition	Something-Something V1	Side4Video (EVA ViT-E/14	Top 5 Accuracy	88.8	# 2	Compare
Action Recognition	Something-Something V2	Side4Video (EVA ViT-E/14)	Top-1 Accuracy	75.2	# 10	Compare
Action Recognition	Something-Something V2	Side4Video (EVA ViT-E/14)	Top-5 Accuracy	94.0	# 13	Compare
Video Retrieval	VATEX	Side4Video	text-to-video R@1	68.8	# 6	Compare
			text-to-video R@50	1.0	# 3	Compare
			text-to-video R@10	97.0	# 4	Compare
			text-to-video R@5	93.5	# 2	Compare
			text-to-video MedianR	2.7	# 2	Compare

Methods

Add Remove

Focus

Edit Social Preview

Side4Video: Spatial-Temporal Side Network for Memory-Efficient Image-to-Video Transfer Learning

Code Edit Add Remove Mark official

Tasks Edit Add Remove

Datasets Edit

Results from the Paper Edit

Methods Edit Add Remove

Code

Add Remove Mark official

Tasks

Add Remove

Datasets

Results from the Paper

Edit

Methods

Add Remove