Cross-Enhancement Transformer for Action Segmentation

19 May 2022  ·  Jiahui Wang, Zhenyou Wang, Shanna Zhuang, Hui Wang ·

Temporal convolutions have been the paradigm of choice in action segmentation, which enhances long-term receptive fields by increasing convolution layers. However, high layers cause the loss of local information necessary for frame recognition. To solve the above problem, a novel encoder-decoder structure is proposed in this paper, called Cross-Enhancement Transformer. Our approach can be effective learning of temporal structure representation with interactive self-attention mechanism. Concatenated each layer convolutional feature maps in encoder with a set of features in decoder produced via self-attention. Therefore, local and global information are used in a series of frame actions simultaneously. In addition, a new loss function is proposed to enhance the training process that penalizes over-segmentation errors. Experiments show that our framework performs state-of-the-art on three challenging datasets: 50Salads, Georgia Tech Egocentric Activities and the Breakfast dataset.

PDF Abstract

Results from the Paper


Task Dataset Model Metric Name Metric Value Global Rank Result Benchmark
Action Segmentation 50 Salads CETNet F1@10% 87.6 # 10
Edit 81.7 # 11
Acc 86.9 # 10
F1@25% 86.5 # 9
F1@50% 80.1 # 9
Action Segmentation Breakfast CETNet F1@10% 79.3 # 4
F1@50% 61.9 # 5
Acc 74.9 # 8
Edit 77.8 # 5
F1@25% 74.3 # 4
Action Segmentation GTEA CETNet F1@10% 91.8 # 8
F1@50% 81.3 # 7
Acc 80.3 # 8
Edit 87.9 # 8
F1@25% 91.2 # 7

Methods