BUS: Efficient and Effective Vision-Language Pre-Training with Bottom-Up Patch Summarization.

Vision Transformer (ViT) based Vision-Language Pretraining (VLP) models recently demonstrated impressive performance in various tasks. However, the lengthy visual token sequences used in these models can lead to inefficient and ineffective performance. Existing methods to address these issues lack textual guidance and may overlook crucial visual information related to the text, leading to the introduction of irrelevant information during cross-modal fusion and additional computational cost. In this paper, we propose a Bottom-Up Patch Summarization approach named BUS which is inspired by the Document Summarization Task in NLP to learn a concise visual summary of lengthy visual token sequences, guided by textual semantics. We introduce a Text-Semantic Aware Patch Selector (TAPS) in the ViT backbone to perform a coarse-grained selective visual summarization to over-determine the text-relevant patches, and a light Summarization Decoder to perform fine-grained abstractive summarization based on the selected patches, resulting in a further condensed representation sequence that highlights text-relevant visual semantic information. Such bottom-up process is both efficient and effective with higher performing. We evaluate our approach on various VL understanding and generation tasks and show competitive or better downstream task performance while boosting the efficiency by 50%. Additionally, our model achieves well-designed SOTA downstream task performance by increasing input image resolution without increasing computational costs compared to baselines.

PDF Abstract

Results from the Paper


  Submit results from this paper to get state-of-the-art GitHub badges and help the community compare results to other papers.

Methods