Revealing Single Frame Bias for Video-and-Language Learning

7 Jun 2022  ·  Jie Lei, Tamara L. Berg, Mohit Bansal ·

Training an effective video-and-language model intuitively requires multiple frames as model inputs. However, it is unclear whether using multiple frames is beneficial to downstream tasks, and if yes, whether the performance gain is worth the drastically-increased computation and memory costs resulting from using more frames. In this work, we explore single-frame models for video-and-language learning. On a diverse set of video-and-language tasks (including text-to-video retrieval and video question answering), we show the surprising result that, with large-scale pre-training and a proper frame ensemble strategy at inference time, a single-frame trained model that does not consider temporal information can achieve better performance than existing methods that use multiple frames for training. This result reveals the existence of a strong "static appearance bias" in popular video-and-language datasets. Therefore, to allow for a more comprehensive evaluation of video-and-language models, we propose two new retrieval tasks based on existing fine-grained action recognition datasets that encourage temporal modeling. Our code is available at https://github.com/jayleicn/singularity

PDF Abstract

Results from the Paper


Ranked #5 on Video Retrieval on SSv2-template retrieval (using extra training data)

     Get a GitHub badge
Task Dataset Model Metric Name Metric Value Global Rank Uses Extra
Training Data
Result Benchmark
Zero-Shot Video Retrieval ActivityNet Singularity-temporal-17M text-to-video R@1 30.6 # 11
text-to-video R@10 66.9 # 9
text-to-video R@5 55.6 # 10
Video Retrieval ActivityNet Singularity text-to-video R@1 47.1 # 18
text-to-video R@5 75.5 # 16
text-to-video R@10 85.5 # 16
Zero-Shot Video Retrieval ActivityNet Singularity-temporal-5M text-to-video R@1 30.8 # 9
text-to-video R@10 66.3 # 10
text-to-video R@5 55.9 # 9
Video Question Answering ActivityNet-QA Singularity-temporal Accuracy 44.1 # 20
Video Question Answering ActivityNet-QA Singularity Accuracy 43.1 # 22
Video Retrieval DiDeMo Singularity text-to-video R@1 53.9 # 17
text-to-video R@5 79.4 # 13
text-to-video R@10 86.9 # 13
Zero-Shot Video Retrieval DiDeMo Singularity-17M text-to-video R@1 37.1 # 10
text-to-video R@5 61.7 # 11
text-to-video R@10 69.9 # 12
Zero-Shot Video Retrieval DiDeMo Singularity-5M text-to-video R@1 36.9 # 11
text-to-video R@5 61.1 # 12
text-to-video R@10 69.3 # 13
Zero-Shot Video Retrieval MSR-VTT Singularity-5M text-to-video R@1 28.4 # 19
text-to-video R@5 50.2 # 18
text-to-video R@10 59.5 # 20
Zero-Shot Video Retrieval MSR-VTT Singularity-17M text-to-video R@1 34.0 # 15
text-to-video R@5 56.7 # 15
text-to-video R@10 66.7 # 14
Video Retrieval MSR-VTT-1kA Singularity text-to-video R@1 41.5 # 34
text-to-video R@5 68.7 # 36
text-to-video R@10 77 # 41
Video Question Answering MSRVTT-MC Singularity-temporal Accuracy 93.7 # 5
Video Question Answering MSRVTT-MC Singularity Accuracy 92.1 # 7
Video Question Answering MSRVTT-QA Singularity-temporal Accuracy 43.9 # 12
Video Question Answering MSRVTT-QA Singularity Accuracy 43.5 # 13
Video Retrieval SSv2-label retrieval Singularity-temporal text-to-video R@1 47.4 # 5
text-to-video R@5 75.9 # 5
text-to-video R@10 84 # 3
Video Retrieval SSv2-template retrieval Singularity-temporal text-to-video R@1 77.6 # 5
text-to-video R@5 96 # 5
text-to-video R@10 98.9 # 5

Methods


No methods listed for this paper. Add relevant methods here