Learning Better Visual Representations for Weakly-Supervised Object Detection Using Natural Language Supervision

29 Sep 2021 · Mesut Erhan Unal, Adriana Kovashka ·

We present a framework to better leverage natural language supervision for a specific downstream task, namely weakly-supervised object detection (WSOD). Our framework employs a multimodal pre-training step, during which region-level groundings are learned in a weakly-supervised manner and later maintained for the downstream task. Further, to appropriately use the noisy supervision that captions contain for object detection, we use coherence analysis and other cross-modal alignment metrics to weight image-caption pairs during WSOD training. Results indicate that WSOD can better leverage representation learning by (1) learning a region-based alignment between image regions and caption tokens, (2) enforcing the visual backbone does not forget this alignment during the downstream WSOD task, and (3) suppressing instances that have weak image-caption correspondence during the WSOD training stage.

PDF Abstract