Side4Video: Spatial-Temporal Side Network for Memory-Efficient Image-to-Video Transfer Learning

27 Nov 2023  ·  Huanjin Yao, Wenhao Wu, Zhiheng Li ·

Large pre-trained vision models achieve impressive success in computer vision. However, fully fine-tuning large models for downstream tasks, particularly in video understanding, can be prohibitively computationally expensive. Recent studies turn their focus towards efficient image-to-video transfer learning. Nevertheless, existing efficient fine-tuning methods lack attention to training memory usage and exploration of transferring a larger model to the video domain. In this paper, we present a novel Spatial-Temporal Side Network for memory-efficient fine-tuning large image models to video understanding, named Side4Video. Specifically, we introduce a lightweight spatial-temporal side network attached to the frozen vision model, which avoids the backpropagation through the heavy pre-trained model and utilizes multi-level spatial features from the original image model. Extremely memory-efficient architecture enables our method to reduce 75% memory usage than previous adapter-based methods. In this way, we can transfer a huge ViT-E (4.4B) for video understanding tasks which is 14x larger than ViT-L (304M). Our approach achieves remarkable performance on various video datasets across unimodal and cross-modal tasks (i.e., action recognition and text-video retrieval), especially in Something-Something V1&V2 (67.3% & 74.6%), Kinetics-400 (88.6%), MSR-VTT (52.3%), MSVD (56.1%) and VATEX (68.8%). We release our code at https://github.com/HJYao00/Side4Video.

PDF Abstract
Task Dataset Model Metric Name Metric Value Global Rank Result Benchmark
Action Classification Kinetics-400 Side4Video (EVA, ViT-E/14) Acc@1 88.6 # 18
Acc@5 98.2 # 10
Video Retrieval MSR-VTT-1kA Side4Video text-to-video Mean Rank 12.8 # 13
text-to-video R@1 52.3 # 13
text-to-video R@5 75.5 # 17
text-to-video R@10 84.2 # 16
text-to-video Median Rank 1.0 # 1
Video Retrieval MSVD Side4Video text-to-video R@1 56.1 # 8
text-to-video R@5 81.7 # 7
text-to-video R@10 88.8 # 6
text-to-video Median Rank 1.0 # 1
text-to-video Mean Rank 8.4 # 4
Action Recognition Something-Something V1 Side4Video (EVA ViT-E/14 Top 1 Accuracy 67.3 # 3
Top 5 Accuracy 88.8 # 2
Action Recognition Something-Something V2 Side4Video (EVA ViT-E/14) Top-1 Accuracy 75.2 # 10
Top-5 Accuracy 94.0 # 13
Video Retrieval VATEX Side4Video text-to-video R@1 68.8 # 6
text-to-video R@50 1.0 # 3
text-to-video R@10 97.0 # 4
text-to-video R@5 93.5 # 2
text-to-video MedianR 2.7 # 2

Methods