Audio-Guided Attention Network for Weakly Supervised Violence Detection

Conference 2022  ·  Yujiang Pu, Xiaoyu Wu ·

Detecting violence in video is a challenging task due to its complex scenarios and great intra-class variability. Most previous works specialize in the analysis of appearance or motion information, ignoring the co-occurrence of some audio and visual events. Physical conflicts such as abuse and fighting are usually accompanied by screaming, while crowd violence such as riots and wars are generally related to gunshots and explosions. Therefore, we propose a novel audio-guided multimodal violence detection framework. First, deep neural networks are used to extract appearance and audio features, respectively. Then, a Cross-Modal Awareness Local-Arousal (CMA-LA) network is proposed for cross-modal interaction, which implements audio-to-visual feature enhancement over temporal dimension. The enhanced features are then fed into a multilayer perceptron (MLP) to capture high-level semantics, followed by a temporal convolution layer to obtain high-confidence violence scores. To validate the proposed method, we conduct experiments on a large violent video dataset, XD Violence. Comprehensive experiments demonstrate the robust performance of our approach, which also achieves a new state-of-the-art AP result.

PDF

Datasets


Task Dataset Model Metric Name Metric Value Global Rank Benchmark
Anomaly Detection In Surveillance Videos XD-Violence CMA_LA AP 83.54 # 5

Methods