Mining Inter-Video Proposal Relations for Video Object Detection

ECCV 2020  ·  Mingfei Han, Yali Wang, Xiaojun Chang, Yu Qiao ·

Recent studies have shown that, context aggregating information from proposals in different frames can clearly enhance the performance of video object detection. However, these approaches mainly exploit the intra-proposal relation within single video, while ignoring the intra-proposal relation among different videos, which can provide important discriminative cues for recognizing confusing objects. To address the limitation, we propose a novel Inter-Video Proposal Relation module. Based on a concise multi-level triplet selection scheme, this module can learn effective object representations via modeling relations of hard proposals among different videos. Moreover, we design a Hierarchical Video Relation Network (HVR-Net), by integrating intra-video and inter-video proposal relations in a hierarchical fashion. This design can progressively exploit both intra and inter contexts to boost video object detection. We examine our method on the large-scale video object detection benchmark, i.e., ImageNet VID, where HVR-Net achieves the SOTA results. Codes and models will be released afterwards.

PDF Abstract

Datasets


Results from the Paper


Task Dataset Model Metric Name Metric Value Global Rank Benchmark
Video Object Detection ImageNet VID HVRNet (ResNest101) MAP 83.8 # 15
Video Object Detection ImageNet VID HVRNet (ResNeXt101-32x4d) MAP 85.5 # 8

Methods