Improve Temporal Action Proposals using Hierarchical Context

Temporal action proposal (TAP) aims to generate accurate candidates of action instances in an untrimmed video. It has been proved that contexts are critically important to this task. In this paper, we propose a novel hierarchical context network (HCN) to further explore the snippet-level and proposal-level contexts, which are used to improve the representations of snippets and proposals, respectively. First, we pinpoint that different scales of snippet-level contexts are not equally important for different action instances. To this end, we incorporate a novel gating mechanism into the U-Net structure to capture the content-adaptive snippet-level contexts. Second, to exploit the proposal-level contexts, we propose a task-specific self-attention model with high efficiency. By stacking multiple attention models, we can deeply explore the proposal-level contexts in a wide range. Finally, to leverage both levels of context, we equip HCN with three branches to evaluate proposals from local to global perspectives. Our experiments on the ActivityNet-1.3 and THUMOS14 datasets show that HCN significantly outperforms previous TAP methods. Additionally, further experiments demonstrate that our method can substantially improve the state-of-the-art action detection performance when combined with existing action classifiers.

Task Dataset Model Metric Name Metric Value Global Rank Benchmark
Temporal Action Localization ActivityNet-1.3 HCN(I3D features) mAP IOU@0.5 52.51 # 14
mAP 35.61 # 18
mAP IOU@0.75 36.10 # 10
mAP IOU@0.95 7.12 # 21
Temporal Action Proposal Generation ActivityNet-1.3 HCN AUC (val) 68.78 # 5
AR@100 77.13 # 3
Temporal Action Proposal Generation THUMOS' 14 HCH AR@100 50.86 # 1
AR@1000 67.34 # 2
AR@200 57.56 # 1
AR@50 64.28 # 1
