MELTR: Meta Loss Transformer for Learning to Fine-tune Video Foundation Models

Foundation models have shown outstanding performance and generalization capabilities across domains. Since most studies on foundation models mainly focus on the pretraining phase, a naive strategy to minimize a single task-specific loss is adopted for fine-tuning. However, such fine-tuning methods do not fully leverage other losses that are potentially beneficial for the target task. Therefore, we propose MEta Loss TRansformer (MELTR), a plug-in module that automatically and non-linearly combines various loss functions to aid learning the target task via auxiliary learning. We formulate the auxiliary learning as a bi-level optimization problem and present an efficient optimization algorithm based on Approximate Implicit Differentiation (AID). For evaluation, we apply our framework to various video foundation models (UniVL, Violet and All-in-one), and show significant performance gain on all four downstream tasks: text-to-video retrieval, video question answering, video captioning, and multi-modal sentiment analysis. Our qualitative analyses demonstrate that MELTR adequately `transforms' individual loss functions and `melts' them into an effective unified loss. Code is available at https://github.com/mlvlab/MELTR.

PDF Abstract CVPR 2023 PDF CVPR 2023 Abstract

Results from the Paper


Task Dataset Model Metric Name Metric Value Global Rank Uses Extra
Training Data
Result Benchmark
Multimodal Sentiment Analysis CMU-MOSI UniVL + MELTR F1 85.4 # 2
MAE 0.759 # 8
Corr 0.789 # 6
Acc-2 85.3 # 4
Video Retrieval MSR-VTT All-in-one + MELTR text-to-video R@1 38.6 # 14
text-to-video R@5 74.4 # 7
text-to-video R@10 84.7 # 6
Video Captioning MSR-VTT UniVL + MELTR CIDEr 52.77 # 22
METEOR 29.26 # 16
ROUGE-L 62.35 # 17
BLEU-4 44.17 # 18
Video Retrieval MSR-VTT UniVL + MELTR text-to-video R@1 28.5 # 27
text-to-video R@5 55.5 # 22
text-to-video R@10 67.6 # 20
text-to-video Median Rank 4 # 7
Video Retrieval MSR-VTT VIOLET + MELTR text-to-video R@1 33.6 # 19
text-to-video R@5 63.7 # 14
text-to-video R@10 77.8 # 13
text-to-video Median Rank 3 # 1
Video Retrieval MSR-VTT-1kA VIOLET + MELTR text-to-video R@1 35.5 # 43
text-to-video R@5 67.2 # 38
text-to-video R@10 78.4 # 39
text-to-video Median Rank 3 # 24
Video Retrieval MSR-VTT-1kA All-in-one + MELTR text-to-video R@1 41.3 # 35
text-to-video R@5 73.5 # 24
text-to-video R@10 82.5 # 27
Video Retrieval MSR-VTT-1kA UniVL + MELTR text-to-video R@1 31.1 # 45
text-to-video R@5 55.7 # 46
text-to-video R@10 68.3 # 49
text-to-video Median Rank 4 # 28
Visual Question Answering (VQA) MSVD-QA VIOLET + MELTR Accuracy 0.517 # 20
TGIF-Transition TGIF-QA VIOLET + MELTR Accuracy 97.5 # 5
TGIF-Frame TGIF-QA VIOLET + MELTR Accuracy 63.4 # 16
TGIF-Action TGIF-QA VIOLET + MELTR Accuracy 95.4 # 3
Video Captioning YouCook2 UniVL + MELTR BLEU-3 24.12 # 1
BLEU-4 17.92 # 2
METEOR 22.56 # 1
ROUGE-L 47.04 # 1
CIDEr 1.90 # 2
Video Retrieval YouCook2 UniVL + MELTR text-to-video Median Rank 3 # 1
text-to-video R@1 33.7 # 2
text-to-video R@10 74.8 # 3
text-to-video R@5 63.1 # 3

Methods


No methods listed for this paper. Add relevant methods here