Global Temporal Difference Network for Action Recognition

TMM 2022  ·  Zhao Xie, Jiansong Chen, Kewei Wu, Dan Guo, Richang Hong ·

—Temporal modeling still remains as a challenge for action recognition. Most existing temporal models focus on learning local variation between neighbor frames. There exists obvious deviations between local and global variations, such as subtle and notable motion variations. In this paper, we propose a global temporal difference module for action recognition, which consists of two sub-modules, i.e., a global aggregation module and a global difference module. These two sub-modules cooperate following the idea of using prior knowledge from the global view (i.e., global motion variation) to guide local learning at each moment. In the global aggregation module, the global prior knowledge is learned by aggregating the visual feature sequence of video into a global vector. In the global difference module, we prepare the difference vector sequence of video by subtracting each local vector from the global vector. Our method performs as a contextual guidance with a global view. The sequential dependency between these difference vectors is exploited with a channel-wise self-attention operation. Finally, the difference vectors at each timestamp are further used to enhance the semantics of the original local features. The enhanced features endow the action recognition has less deviation to understand the variation in the video globally. We instantiate the global temporal difference module into the ResNet block to form a global temporal difference network (GTDNet). Exhaustive experiments are conducted and our method achieves competitive performance at small FLOPs on Something-Something V1 & V2 and Kinetics-400.

PDF
No code implementations yet. Submit your code now
Task Dataset Model Metric Name Metric Value Global Rank Benchmark
Action Recognition Something-Something V2 GTDNet Top-1 Accuracy 67.6 # 62

Methods