COMEDIAN: Self-Supervised Learning and Knowledge Distillation for Action Spotting using Transformers

3 Sep 2023  ยท  Julien Denize, Mykola Liashuha, Jaonary Rabarisoa, Astrid Orcesi, Romain Hรฉrault ยท

We present COMEDIAN, a novel pipeline to initialize spatiotemporal transformers for action spotting, which involves self-supervised learning and knowledge distillation. Action spotting is a timestamp-level temporal action detection task. Our pipeline consists of three steps, with two initialization stages. First, we perform self-supervised initialization of a spatial transformer using short videos as input. Additionally, we initialize a temporal transformer that enhances the spatial transformer's outputs with global context through knowledge distillation from a pre-computed feature bank aligned with each short video segment. In the final step, we fine-tune the transformers to the action spotting task. The experiments, conducted on the SoccerNet-v2 dataset, demonstrate state-of-the-art performance and validate the effectiveness of COMEDIAN's pretraining paradigm. Our results highlight several advantages of our pretraining pipeline, including improved performance and faster convergence compared to non-pretrained models.

PDF Abstract

Results from the Paper


Task Dataset Model Metric Name Metric Value Global Rank Result Benchmark
Action Spotting SoccerNet-v2 COMEDIAN (ViSwin T ens.) Average-mAP 77.6 # 2
Tight Average-mAP 73.1 # 1
Action Spotting SoccerNet-v2 COMEDIAN (ViSwin T) Average-mAP 76.6 # 4
Tight Average-mAP 71.6 # 3
Action Spotting SoccerNet-v2 COMEDIAN (ViViT T ens.) Average-mAP 77.1 # 3
Tight Average-mAP 72.0 # 2
Action Spotting SoccerNet-v2 COMEDIAN (ViViT T) Average-mAP 76.1 # 5
Tight Average-mAP 70.7 # 4

Methods