Cooperative Cross-Stream Network for Discriminative Action Representation

27 Aug 2019  ·  Jingran Zhang, Fumin Shen, Xing Xu, Heng Tao Shen ·

Spatial and temporal stream model has gained great success in video action recognition. Most existing works pay more attention to designing effective features fusion methods, which train the two-stream model in a separate way. However, it's hard to ensure discriminability and explore complementary information between different streams in existing works. In this work, we propose a novel cooperative cross-stream network that investigates the conjoint information in multiple different modalities. The jointly spatial and temporal stream networks feature extraction is accomplished by an end-to-end learning manner. It extracts this complementary information of different modality from a connection block, which aims at exploring correlations of different stream features. Furthermore, different from the conventional ConvNet that learns the deep separable features with only one cross-entropy loss, our proposed model enhances the discriminative power of the deeply learned features and reduces the undesired modality discrepancy by jointly optimizing a modality ranking constraint and a cross-entropy loss for both homogeneous and heterogeneous modalities. The modality ranking constraint constitutes intra-modality discriminative embedding and inter-modality triplet constraint, and it reduces both the intra-modality and cross-modality feature variations. Experiments on three benchmark datasets demonstrate that by cooperating appearance and motion feature extraction, our method can achieve state-of-the-art or competitive performance compared with existing results.

PDF Abstract

Results from the Paper


Ranked #15 on Action Recognition on HMDB-51 (using extra training data)

     Get a GitHub badge
Task Dataset Model Metric Name Metric Value Global Rank Uses Extra
Training Data
Result Benchmark
Action Recognition HMDB-51 CCS + TSN (ImageNet+Kinetics pretrained) Average accuracy of 3 splits 81.9 # 15
Action Recognition Something-Something V2 CCS + two-stream + TRN Top-1 Accuracy 61.2 # 110
Top-5 Accuracy 89.3 # 71
Action Recognition UCF101 CCS + TSN (ImageNet+Kinetics pretrained) 3-fold Accuracy 97.4 # 17

Methods


No methods listed for this paper. Add relevant methods here