In this paper, we empirically find that stacking more conventional temporal convolution layers actually deteriorates action classification performance, possibly ascribing to that all channels of 1D feature map, which generally are highly abstract and can be regarded as latent concepts, are excessively recombined in temporal convolution.
The paucity of videos in current action classification datasets (UCF-101 and HMDB-51) has made it difficult to identify good video architectures, as most methods obtain similar performance on existing small-scale benchmarks.
In this paper, we introduce a network architecture that takes long-term content into account and enables fast per-video processing at the same time.
Ranked #21 on Action Recognition on Something-Something V1 (using extra training data)
Then we apply the GCNs over the graph to model the relations among different proposals and learn powerful representations for the action classification and localization.
Ranked #1 on Temporal Action Localization on THUMOS’14
To address this challenging issue, we exploit the effectiveness of deep networks in temporal action localization via three segment-based 3D ConvNets: (1) a proposal network identifies candidate segments in a long video that may contain actions; (2) a classification network learns one-vs-all action classification model to serve as initialization for the localization network; and (3) a localization network fine-tunes on the learned classification network to localize each action instance.
Ranked #1 on Temporal Action Localization on MEXaction2
This work introduces pyramidal convolution (PyConv), which is capable of processing the input at multiple filter scales.
Ranked #6 on Semantic Segmentation on ADE20K
Our architecture is trained and evaluated on the standard video actions benchmarks of UCF-101 and HMDB-51, where it is competitive with the state of the art.
Ranked #1 on Action Classification on HMDB51