Semi-Weakly-Supervised Learning of Complex Actions From Instructional Task Videos

CVPR 2022 · YuHan Shen, Ehsan Elhamifar ·

We address the problem of action segmentation in instructional task videos with a small number of weakly-labeled training videos and a large number of unlabeled videos, which we refer to as Semi-Weakly-Supervised Learning (SWSL) of actions. We propose a general SWSL framework that can efficiently learn from both types of videos and can leverage any of the existing weakly-supervised action segmentation methods. Our key observation is that the distance between the transcript of an unlabeled video and those of the weakly-labeled videos from the same task is small yet often nonzero. Therefore, we develop a Soft Restricted Edit (SRE) loss to encourage small variations between the predicted transcripts of unlabeled videos and ground-truth transcripts of the weakly-labeled videos of the same task. To compute the SRE loss, we develop a flexible transcript prediction (FTP) method that uses the output of the action classifier to find both the length of the transcript and the sequence of actions occurring in an unlabeled video. We propose an efficient learning scheme in which we alternate between minimizing our proposed loss and generating pseudo-transcripts for unlabeled videos. By experiments on two benchmark datasets, we demonstrate that our approach can significantly improve the performance by using unlabeled videos, especially when the number of weakly-labeled videos is small.

PDF Abstract