Late Temporal Modeling in 3D CNN Architectures with BERT for Action Recognition

In this work, we combine 3D convolution with late temporal modeling for action recognition. For this aim, we replace the conventional Temporal Global Average Pooling (TGAP) layer at the end of 3D convolutional architecture with the Bidirectional Encoder Representations from Transformers (BERT) layer in order to better utilize the temporal information with BERT's attention mechanism... (read more)

PDF Abstract

Results from the Paper


 Ranked #1 on Action Recognition on HMDB-51 (using extra training data)

     Get a GitHub badge
TASK DATASET MODEL METRIC NAME METRIC VALUE GLOBAL RANK USES EXTRA
TRAINING DATA
RESULT BENCHMARK
Action Recognition HMDB-51 R2+1D-BERT Average accuracy of 3 splits 85.10 # 1
Action Recognition UCF101 R2+1D-BERT 3-fold Accuracy 98.69 # 1

Methods used in the Paper