Set-Supervised Action Learning in Procedural Task Videos via Pairwise Order Consistency

CVPR 2022  ·  Zijia Lu, Ehsan Elhamifar ·

We address the problem of set-supervised action learning, whose goal is to learn an action segmentation model using weak supervision in the form of sets of actions occurring in training videos. Our key observation is that videos within the same task have similar ordering of actions, which can be leveraged for effective learning. Therefore, we propose an attention-based method with a new Pairwise Ordering Consistency (POC) loss that encourages that for each common action pair in two videos of the same task, the attentions of actions follow a similar ordering. Unlike existing sequence alignment methods, which misalign actions in videos with different orderings or cannot reliably separate more from less consistent orderings, our POC loss efficiently aligns videos with different action orders and is differentiable, which enables end-to-end training. In addition, it avoids the time-consuming pseudo-label generation of prior works. Our method efficiently learns the actions and their temporal locations, therefore, extends the existing attention-based action localization methods from learning one action per video to multiple actions using our POC loss along with video-level and frame-level losses. By experiments on three datasets, we demonstrate that our method significantly improves the state of the art. We also show that our method, with a small modification, can effectively address the transcript-supervised action learning task, where actions and their ordering are available during training.

PDF Abstract
Task Dataset Model Metric Name Metric Value Global Rank Benchmark
Weakly Supervised Action Segmentation (Action Set)) Breakfast POC Acc 42.4 # 2

Methods


No methods listed for this paper. Add relevant methods here