Token-Level Contrast for Video and Language Alignment

1 Jan 2021 · Jianwei Yang, Yonatan Bisk, Jianfeng Gao ·

Building video and language understanding models requires grounding linguistic concepts and video contents into a shared space. Most of previous works learn a holistic alignment between them while neglecting the token-level grounding. Masked token prediction can be used to learn token-level multi-modal representation, but it does not necessarily force lexical grounding on perception and also introduce a domain-shift between pretraining and fine-tuning. This paper introduces a simple token-level contrastive loss (ToCo) informed by syntactic classes (e.g., nouns and verbs) to force the model to prioritize grounding concrete semantic bearing words. ToCo does not mask inputs but poses both local (contextual token) and global (lexical type) pressures for multi-modal alignment in a contrastive manner. Our approach enables a simple vanilla BERT-based multimodal transformer to compete with or outperform existing heavily engineered multi-loss or large models on three benchmarks (YouCook2, MSR-VTT and CrossTask). Further, it is plug-n-play such that gains are made in both pretraining and downstream tasks solely, regardless of the underlying visual or textual feature representations.

PDF Abstract