Leveraging triplet loss for unsupervised action segmentation

13 Apr 2023  ยท  E. Bueno-Benito, B. Tura, M. Dimiccoli ยท

In this paper, we propose a novel fully unsupervised framework that learns action representations suitable for the action segmentation task from the single input video itself, without requiring any training data. Our method is a deep metric learning approach rooted in a shallow network with a triplet loss operating on similarity distributions and a novel triplet selection strategy that effectively models temporal and semantic priors to discover actions in the new representational space. Under these circumstances, we successfully recover temporal boundaries in the learned action representations with higher quality compared with existing unsupervised approaches. The proposed method is evaluated on two widely used benchmark datasets for the action segmentation task and it achieves competitive performance by applying a generic clustering algorithm on the learned representations.

PDF Abstract
Task Dataset Model Metric Name Metric Value Global Rank Result Benchmark
Action Segmentation Breakfast TSA (FINCH) Acc 65.1 # 27
mIoU 52.1 # 3
Action Segmentation Breakfast TSA (Spectral) Acc 63.2 # 30
mIoU 52.7 # 2
F1 57.8 # 2
Action Segmentation Breakfast TSA (Kmeans) Acc 63.7 # 29
mIoU 53.3 # 1
F1 58 # 1
Action Segmentation Youtube INRIA Instructional TSA (FINCH) F1 54.7 # 2
Acc 62.4 # 1
Action Segmentation Youtube INRIA Instructional TSA (Kmeans) F1 55.3 # 1
Acc 59.7 # 2

Methods