AdaFocus: Towards End-to-end Weakly Supervised Learning for Long-Video Action Understanding

28 Nov 2023  ยท  Jiaming Zhou, Hanjun Li, Kun-Yu Lin, Junwei Liang ยท

Developing end-to-end models for long-video action understanding tasks presents significant computational and memory challenges. Existing works generally build models on long-video features extracted by off-the-shelf action recognition models, which are trained on short-video datasets in different domains, making the extracted features suffer domain discrepancy. To avoid this, action recognition models can be end-to-end trained on clips, which are trimmed from long videos and labeled using action interval annotations. Such fully supervised annotations are expensive to collect. Thus, a weakly supervised method is needed for long-video action understanding at scale. Under the weak supervision setting, action labels are provided for the whole video without precise start and end times of the action clip. To this end, we propose an AdaFocus framework. AdaFocus estimates the spike-actionness and temporal positions of actions, enabling it to adaptively focus on action clips that facilitate better training without the need for precise annotations. Experiments on three long-video datasets show its effectiveness. Remarkably, on two of datasets, models trained with AdaFocus under weak supervision outperform those trained under full supervision. Furthermore, we form a weakly supervised feature extraction pipeline with our AdaFocus, which enables significant improvements on three long-video action understanding tasks.

PDF Abstract
Task Dataset Model Metric Name Metric Value Global Rank Result Benchmark
Action Segmentation Breakfast AdaFocus (newly extracted I3D-features, LT-Context model) F1@10% 82.1 # 1
F1@50% 67.5 # 1
Acc 78.0 # 1
Edit 78.3 # 4
F1@25% 79.0 # 1
Long-video Activity Recognition Breakfast AdaFocus (I3D-Breakfast-Pretrain-feature, GHRM) mAP 69.6 # 4
Long-video Activity Recognition Breakfast AdaFocus (MViT-Breakfast-Pretrain-feature, Timeception) mAP 79.2 # 2
Long-video Activity Recognition Breakfast AdaFocus (I3D-Breakfast-Pretrain-feature, Timeception) mAP 70.4 # 3
Long-video Activity Recognition Breakfast AdaFocus (MViT-Breakfast-Pretrain-feature, GHRM) mAP 79.5 # 1
Weakly Supervised Action Segmentation (Action Set)) Breakfast AdaFocus (newly extracted I3D-features, POC model) Acc 49.6 # 1
Action Classification Charades AdaFocus (weak supervision, MViT-B-24, 32x3) MAP 47.8 # 13
Action Classification Charades AdaFocus (weak supervision, Slowfast-R50, 16x8) MAP 39.3 # 35
Action Classification Charades AdaFocus (weak supervision, X3D-L, 32x3) MAP 41.2 # 29
Action Classification Charades AdaFocus (weak supervision, MViT-B-K400-pretrain, 16x4) MAP 41.4 # 28
Temporal Sentence Grounding Charades-STA AdaFocus (Full, I3D-Charades-Pretrain-feature, MMN model) R1@0.5 56.7 # 2
R1@0.7 35.6 # 2
R5@0.7 65.0 # 2
R5@0.5 87.9 # 3
Temporal Sentence Grounding Charades-STA AdaFocus (Weak, I3D-Charades-Pretrain-feature, CPL model) R1@0.5 49.1 # 7
R1@0.7 22.4 # 6
R5@0.7 51.8 # 7
R5@0.5 84.2 # 8
Temporal Sentence Grounding Charades-STA AdaFocus (Weak, MViT-Charades-Pretrain-feature, CPL model) R1@0.5 51.7 # 4
R1@0.7 23.2 # 5
R5@0.7 52.6 # 6
R5@0.5 85.2 # 6
Temporal Sentence Grounding Charades-STA AdaFocus (Full, MViT-Charades-Pretrain-feature, MMN model) R1@0.5 62.4 # 1
R1@0.7 38.6 # 1
R5@0.7 66.4 # 1
R5@0.5 89.4 # 1
Temporal Sentence Grounding Charades-STA AdaFocus (Semi-weak, MViT-Charades-Pretrain-feature, D3G model) R1@0.5 50.1 # 5
R1@0.7 21.8 # 7
R5@0.7 54.6 # 5
R5@0.5 86.1 # 4
Temporal Sentence Grounding Charades-STA AdaFocus (Semi-weak, I3D-Charades-Pretrain-feature, D3G model) R1@0.5 46.9 # 9
R1@0.7 21.1 # 9
R5@0.7 49.2 # 10
R5@0.5 79.3 # 11

Methods


No methods listed for this paper. Add relevant methods here