# Late Temporal Modeling in 3D CNN Architectures with BERT for Action Recognition

3 Aug 2020

In this work, we combine 3D convolution with late temporal modeling for action recognition. For this aim, we replace the conventional Temporal Global Average Pooling (TGAP) layer at the end of 3D convolutional architecture with the Bidirectional Encoder Representations from Transformers (BERT) layer in order to better utilize the temporal information with BERT's attention mechanism... (read more)

PDF Abstract

# Results from the Paper Edit

Ranked #1 on Action Recognition on HMDB-51 (using extra training data)

TASK DATASET MODEL METRIC NAME METRIC VALUE GLOBAL RANK USES EXTRA
TRAINING DATA
RESULT BENCHMARK
Action Recognition HMDB-51 R2+1D-BERT Average accuracy of 3 splits 85.10 # 1
Action Recognition UCF101 R2+1D-BERT 3-fold Accuracy 98.69 # 1