ZeroI2V: Zero-Cost Adaptation of Pre-trained Transformers from Image to Video

2 Oct 2023  ·  Xinhao Li, LiMin Wang ·

Adapting image models to video domain is becoming an efficient paradigm for solving video recognition tasks. Due to the huge number of parameters and effective transferability of image models, performing full fine-tuning is less efficient and even unnecessary. Thus, recent research is shifting its focus towards parameter-efficient image-to-video adaptation. However, these adaptation strategies inevitably introduce extra computational cost to deal with the domain gap and temporal modeling in videos. In this paper, our goal is to present a zero-cost adaptation paradigm (ZeroI2V) to transfer the image transformers to video recognition tasks (i.e., introduce zero extra cost to the adapted models during inference). To achieve this goal, we present two core designs. First, to capture the dynamics in videos and reduce the difficulty of achieving image-to-video adaptation, we exploit the flexibility of self-attention and introduce the spatial-temporal dual-headed attention (STDHA) that efficiently endow the image transformers with temporal modeling capability at zero extra parameters and computation. Second, to handle the domain gap between images and videos, we propose a linear adaption strategy which utilizes lightweight densely placed linear adapters to fully transfer the frozen image models to video recognition. Due to its customized linear design, all newly added adapters could be easily merged with the original modules through structural reparameterization after training, thus achieving zero extra cost during inference. Extensive experiments on four widely-used video recognition benchmarks show that our ZeroI2V can match or even outperform previous state-of-the-art methods while enjoying superior parameter and inference efficiency.

PDF Abstract

Results from the Paper


Ranked #5 on Action Recognition on UCF101 (using extra training data)

     Get a GitHub badge
Task Dataset Model Metric Name Metric Value Global Rank Uses Extra
Training Data
Result Benchmark
Action Recognition HMDB-51 ZeroI2V ViT-L/14 Average accuracy of 3 splits 83.4 # 11
Action Classification Kinetics-400 ZeroI2V ViT-L/14 Acc@1 87.2 # 31
Acc@5 97.6 # 18
Action Recognition Something-Something V2 ZeroI2V ViT-L/14 Top-1 Accuracy 72.2 # 24
Top-5 Accuracy 93.0 # 19
Action Recognition UCF101 ZeroI2V ViT-L/14 3-fold Accuracy 98.6 # 5

Methods