|TREND||DATASET||BEST METHOD||PAPER TITLE||PAPER||CODE||COMPARE|
Existing approaches usually learn action representations by sequential prediction but they suffer from the inability to fully learn semantic information.
We introduce Multi-Temporal networks that model spatio-temporal patterns of different temporal durations at each layer.
In this paper, we explore the various methods to embed the ensemble power into a single model.
Ranked #1 on Action Recognition on Something-Something V2
Experiments on three publicly available multimodal HAR datasets demonstrate that the proposed MGAF outperforms the previous state of the art fusion methods for depth-inertial HAR in terms of recognition accuracy while being computationally much more efficient.
In recent years, a number of approaches based on 2D CNNs and 3D CNNs have emerged for video action recognition, achieving state-of-the-art results on several large-scale benchmark datasets.
However, the complexity of the State-Of-The-Art (SOTA) models of this task tends to be exceedingly sophisticated and over-parameterized, where the low efficiency in model training and inference has obstructed the development in the field, especially for large-scale action datasets.
Ranked #1 on Skeleton Based Action Recognition on NTU RGB+D 120