Cross Modal Retrieval with Querybank Normalisation

Profiting from large-scale training datasets, advances in neural architecture design and efficient inference, joint embeddings have become the dominant approach for tackling cross-modal retrieval. In this work we first show that, despite their effectiveness, state-of-the-art joint embeddings suffer significantly from the longstanding "hubness problem" in which a small number of gallery embeddings form the nearest neighbours of many queries. Drawing inspiration from the NLP literature, we formulate a simple but effective framework called Querybank Normalisation (QB-Norm) that re-normalises query similarities to account for hubs in the embedding space. QB-Norm improves retrieval performance without requiring retraining. Differently from prior work, we show that QB-Norm works effectively without concurrent access to any test set queries. Within the QB-Norm framework, we also propose a novel similarity normalisation method, the Dynamic Inverted Softmax, that is significantly more robust than existing approaches. We showcase QB-Norm across a range of cross modal retrieval models and benchmarks where it consistently enhances strong baselines beyond the state of the art. Code is available at https://vladbogo.github.io/QB-Norm/.

PDF Abstract CVPR 2022 PDF CVPR 2022 Abstract

Results from the Paper


Task Dataset Model Metric Name Metric Value Global Rank Uses Extra
Training Data
Result Benchmark
Text to Audio Retrieval AudioCaps QB-Norm+CE R@1 23.9 # 7
R@10 71.6±0.4 # 6
Video Retrieval DiDeMo QB-Norm+CLIP4Clip text-to-video R@1 43.5 # 32
text-to-video R@5 71.4 # 29
text-to-video R@10 80.9 # 28
text-to-video Median Rank 2.0 # 9
Video Retrieval LSMDC QB-Norm+CLIP4Clip text-to-video R@1 22.4 # 23
text-to-video R@5 40.1 # 20
text-to-video R@10 49.5 # 20
text-to-video Median Rank 11.0 # 10
Video Retrieval MSR-VTT-1kA QB-Norm+CLIP2Video text-to-video R@1 47.2 # 27
text-to-video R@5 73.0 # 26
text-to-video R@10 83.0 # 25
text-to-video Median Rank 2 # 10
Video Retrieval MSVD QB-Norm+CLIP2Video text-to-video R@1 48.0 # 14
text-to-video R@5 77.9 # 12
text-to-video R@10 86.2 # 11
text-to-video Median Rank 2.0 # 8
Video Retrieval QuerYD QB-Norm+TT-CE+ text-to-video R@1 15.1 # 5
Metric Learning Stanford Online Products QB-Norm+RDML R@1 78.1 # 30
Video Retrieval VATEX QB-Norm+CLIP2Video text-to-video R@1 58.8 # 10
text-to-video R@10 93.8 # 7

Methods