AE-Net:Adjoint Enhancement Network for Efficient Action Recognition in Video Understanding

Action recognition in video understanding is a challenging task, largely because of the complexity and difficulty in temporal modeling, making it suffer from motion information loss and misalignment of temporal attention in spatial dimensions. To overcome these difficulties, we propose a novel temporal modeling method called Adjoint Enhancement Network (AE-Net), which can fully explore clues of motion and time in the long-range structure. The AE-Net mainly consists of two new modules: the Initial Adjoint Enhancement Module (IAE-Module), which deals with shallow features; and the Global Adjoint Enhancement Module (GAE-Module), which deals with global features. With a novel mechanism of parallel spatio-temporal convolution and difference fusion, the IAE-Module is to enhance the degree of motion transformation in shallow network features, exciting the potential of motion flow and avoiding motion information loss. The GAE-Module is proposed to improve the local temporal representation in long-range structures by feeding the enhanced feature differences into a spatial cascade module with residuals to resolve the misalignment of temporal attention in the spatial dimension.The experimental results show that our AE-Net can achieve state-of-the-art results in Something-Something V1, UCF101 and HMDB-51 datasets.

PDF
Task Dataset Model Metric Name Metric Value Global Rank Benchmark
Action Recognition Something-Something V1 AE-Net (8+16frames) Top 1 Accuracy 55.0 # 28

Methods