Multi-region two-stream R-CNN for action detection
We propose a multi-region two-stream R-CNN model for action detection in realistic videos. We start from frame-level action detection based on faster R-CNN [1], and make three contributions: (1) we show that a motion region proposal network generates high-quality proposals , which are complementary to those of an appearance region proposal network; (2) we show that stacking optical flow over several frames significantly improves frame-level action detection; and (3) we embed a multi-region scheme in the faster R-CNN model, which adds complementary information on body parts. We then link frame-level detections with the Viterbi algorithm, and temporally localize an action with the maximum subarray method. Experimental results on the UCF-Sports, J-HMDB and UCF101 action detection datasets show that our approach outperforms the state of the art with a significant margin in both frame-mAP and video-mAP
PDF AbstractDatasets
Task | Dataset | Model | Metric Name | Metric Value | Global Rank | Benchmark |
---|---|---|---|---|---|---|
Skeleton Based Action Recognition | J-HMDB | MR Two-Sream R-CNN | Accuracy (RGB+pose) | 71.1 | # 7 | |
Action Recognition | UCF101 | MR Two-Sream R-CNN | 3-fold Accuracy | 91.1 | # 68 | |
Action Detection | UCF101-24 | TS R-CNN | Frame-mAP 0.5 | 39.94 | # 11 | |
Action Detection | UCF101-24 | MR-TS R-CNN | Frame-mAP 0.5 | 39.63 | # 12 | |
Action Detection | UCF Sports | TS R-CNN | Video-mAP 0.2 | 94.82 | # 2 | |
Video-mAP 0.5 | 94.82 | # 2 | ||||
Frame-mAP 0.5 | 82.30 | # 3 | ||||
Action Detection | UCF Sports | MR-TS R-CNN | Video-mAP 0.2 | 94.83 | # 1 | |
Video-mAP 0.5 | 94.67 | # 3 | ||||
Frame-mAP 0.5 | 84.52 | # 2 |