TASK	DATASET	MODEL	METRIC NAME	METRIC VALUE	GLOBAL RANK
Action Segmentation	Breakfast	BIT	F1@10%	80.6	# 2
Action Segmentation	Breakfast	BIT	F1@50%	64.7	# 2
Action Segmentation	Breakfast	BIT	Acc	75.5	# 6
Action Segmentation	Breakfast	BIT	Edit	79.0	# 1
Action Segmentation	Breakfast	BIT	F1@25%	75.8	# 3
Action Segmentation	GTEA	BIT	F1@10%	94.8	# 2
Action Segmentation	GTEA	BIT	F1@50%	82.6	# 6
Action Segmentation	GTEA	BIT	Acc	82.0	# 4
Action Segmentation	GTEA	BIT	Edit	92.6	# 1
Action Segmentation	GTEA	BIT	F1@25%	92.8	# 2

Badge	Markdown
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/bit-bi-level-temporal-modeling-for-efficient/action-segmentation-on-breakfast-1)](https://paperswithcode.com/sota/action-segmentation-on-breakfast-1?p=bit-bi-level-temporal-modeling-for-efficient)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/bit-bi-level-temporal-modeling-for-efficient/action-segmentation-on-gtea-1)](https://paperswithcode.com/sota/action-segmentation-on-gtea-1?p=bit-bi-level-temporal-modeling-for-efficient)`

BIT: Bi-Level Temporal Modeling for Efficient Supervised Action Segmentation

28 Aug 2023 · Zijia Lu, Ehsan Elhamifar ·

We address the task of supervised action segmentation which aims to partition a video into non-overlapping segments, each representing a different action. Recent works apply transformers to perform temporal modeling at the frame-level, which suffer from high computational cost and cannot well capture action dependencies over long temporal horizons. To address these issues, we propose an efficient BI-level Temporal modeling (BIT) framework that learns explicit action tokens to represent action segments, in parallel performs temporal modeling on frame and action levels, while maintaining a low computational cost. Our model contains (i) a frame branch that uses convolution to learn frame-level relationships, (ii) an action branch that uses transformer to learn action-level dependencies with a small set of action tokens and (iii) cross-attentions to allow communication between the two branches. We apply and extend a set-prediction objective to allow each action token to represent one or multiple action segments, thus can avoid learning a large number of tokens over long videos with many segments. Thanks to the design of our action branch, we can also seamlessly leverage textual transcripts of videos (when available) to help action segmentation by using them to initialize the action tokens. We evaluate our model on four video datasets (two egocentric and two third-person) for action segmentation with and without transcripts, showing that BIT significantly improves the state-of-the-art accuracy with much lower computational cost (30 times faster) compared to existing transformer-based methods.

PDF Abstract

Code

Add Remove Mark official

No code implementations yet. Submit your code now

Tasks

Add Remove

Action Segmentation

Segmentation

Datasets

Breakfast

GTEA

EgoProceL

Results from the Paper

Edit

Ranked #2 on Action Segmentation on Breakfast

Get a GitHub badge

Task	Dataset	Model	Metric Name	Metric Value	Global Rank	Benchmark
Action Segmentation	Breakfast	BIT	F1@10%	80.6	# 2	Compare
			F1@50%	64.7	# 2	Compare
			Acc	75.5	# 6	Compare
			Edit	79.0	# 1	Compare
			F1@25%	75.8	# 3	Compare
Action Segmentation	GTEA	BIT	F1@10%	94.8	# 2	Compare
			F1@50%	82.6	# 6	Compare
			Acc	82.0	# 4	Compare
			Edit	92.6	# 1	Compare
			F1@25%	92.8	# 2	Compare

Methods

Add Remove

Convolution

Edit Social Preview

BIT: Bi-Level Temporal Modeling for Efficient Supervised Action Segmentation

Code Edit Add Remove Mark official

Tasks Edit Add Remove

Datasets Edit

Results from the Paper Edit

Methods Edit Add Remove

Code

Add Remove Mark official

Tasks

Add Remove

Datasets

Results from the Paper

Edit

Methods

Add Remove