Multi-region two-stream R-CNN for action detection

We propose a multi-region two-stream R-CNN model for action detection in realistic videos. We start from frame-level action detection based on faster R-CNN [1], and make three contributions: (1) we show that a motion region proposal network generates high-quality proposals , which are complementary to those of an appearance region proposal network; (2) we show that stacking optical flow over several frames significantly improves frame-level action detection; and (3) we embed a multi-region scheme in the faster R-CNN model, which adds complementary information on body parts. We then link frame-level detections with the Viterbi algorithm, and temporally localize an action with the maximum subarray method. Experimental results on the UCF-Sports, J-HMDB and UCF101 action detection datasets show that our approach outperforms the state of the art with a significant margin in both frame-mAP and video-mAP

PDF Abstract

Results from the Paper


Task Dataset Model Metric Name Metric Value Global Rank Benchmark
Skeleton Based Action Recognition J-HMDB MR Two-Sream R-CNN Accuracy (RGB+pose) 71.1 # 7
Action Recognition UCF101 MR Two-Sream R-CNN 3-fold Accuracy 91.1 # 68
Action Detection UCF101-24 TS R-CNN Frame-mAP 0.5 39.94 # 11
Action Detection UCF101-24 MR-TS R-CNN Frame-mAP 0.5 39.63 # 12
Action Detection UCF Sports TS R-CNN Video-mAP 0.2 94.82 # 2
Video-mAP 0.5 94.82 # 2
Frame-mAP 0.5 82.30 # 3
Action Detection UCF Sports MR-TS R-CNN Video-mAP 0.2 94.83 # 1
Video-mAP 0.5 94.67 # 3
Frame-mAP 0.5 84.52 # 2

Results from Other Papers


Task Dataset Model Metric Name Metric Value Rank Source Paper Compare
Action Detection J-HMDB MR-TS R-CNN Video-mAP 0.2 74.3 # 9
Video-mAP 0.5 73.09 # 12
Frame-mAP 0.5 58.5 # 9
Action Detection J-HMDB TS R-CNN Video-mAP 0.2 71.1 # 11
Video-mAP 0.5 70.6 # 13
Frame-mAP 0.5 56.9 # 10

Methods