Measuring Progress in Fine-grained Vision-and-Language Understanding

While pretraining on large-scale image-text data from the Web has facilitated rapid progress on many vision-and-language (V&L) tasks, recent work has demonstrated that pretrained models lack "fine-grained" understanding, such as the ability to recognise relationships, verbs, and numbers in images. This has resulted in an increased interest in the community to either develop new benchmarks or models for such capabilities. To better understand and quantify progress in this direction, we investigate four competitive V&L models on four fine-grained benchmarks. Through our analysis, we find that X-VLM (Zeng et al., 2022) consistently outperforms other baselines, and that modelling innovations can impact performance more than scaling Web data, which even degrades performance sometimes. Through a deeper investigation of X-VLM, we highlight the importance of both novel losses and rich data sources for learning fine-grained skills. Finally, we inspect training dynamics, and discover that for some tasks, performance peaks early in training or significantly fluctuates, never converging.

PDF Abstract

Results from the Paper


Task Dataset Model Metric Name Metric Value Global Rank Result Benchmark
Visual Reasoning Winoground X-VLM 16M Text Score 46.7 # 13
Image Score 24.5 # 38
Group Score 21.2 # 27
Visual Reasoning Winoground BLIP-ViT/L 129M Text Score 34.7 # 53
Image Score 14.5 # 71
Group Score 12.2 # 57
Visual Reasoning Winoground BLIP 129M (CapFilt/L) Text Score 34.7 # 53
Image Score 15.2 # 69
Group Score 12.2 # 57
Visual Reasoning Winoground BLIP 129M Text Score 35.5 # 50
Image Score 15.0 # 70
Group Score 11.7 # 61
Visual Reasoning Winoground PEVL 14M Text Score 33.2 # 56
Image Score 15.7 # 66
Group Score 12.2 # 57
Visual Reasoning Winoground BLIP 14M Text Score 36.5 # 45
Image Score 18.5 # 55
Group Score 14.5 # 46
Visual Reasoning Winoground ALBEF 14M Text Score 32.5 # 57
Image Score 16.2 # 62
Group Score 12.7 # 53
Visual Reasoning Winoground X-VLM 4M Text Score 44.0 # 20
Image Score 26.7 # 27
Group Score 21.5 # 25
Visual Reasoning Winoground ALBEF 4M Text Score 29.2 # 73
Image Score 15.5 # 67
Group Score 11.0 # 64

Methods


No methods listed for this paper. Add relevant methods here