TASK	DATASET	MODEL	METRIC NAME	METRIC VALUE	GLOBAL RANK
Robot Manipulation Generalization	The COLOSSEUM	RVT	Average decrease average across all perturbations	-16.316	# 1
Robot Manipulation Generalization	The COLOSSEUM	MVP	Average decrease average across all perturbations	-32.352	# 1
Robot Manipulation Generalization	The COLOSSEUM	R3M	Average decrease average across all perturbations	-66.595	# 1
Robot Manipulation Generalization	The COLOSSEUM	PerAct	Average decrease average across all perturbations	-15.526	# 1

Badge	Markdown
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/the-colosseum-a-benchmark-for-evaluating/robot-manipulation-generalization-on-the)](https://paperswithcode.com/sota/robot-manipulation-generalization-on-the?p=the-colosseum-a-benchmark-for-evaluating)`

THE COLOSSEUM: A Benchmark for Evaluating Generalization for Robotic Manipulation

13 Feb 2024 · Wilbert Pumacay, Ishika Singh, Jiafei Duan, Ranjay Krishna, Jesse Thomason, Dieter Fox ·

To realize effective large-scale, real-world robotic applications, we must evaluate how well our robot policies adapt to changes in environmental conditions. Unfortunately, a majority of studies evaluate robot performance in environments closely resembling or even identical to the training setup. We present THE COLOSSEUM, a novel simulation benchmark, with 20 diverse manipulation tasks, that enables systematical evaluation of models across 12 axes of environmental perturbations. These perturbations include changes in color, texture, and size of objects, table-tops, and backgrounds; we also vary lighting, distractors, and camera pose. Using THE COLOSSEUM, we compare 4 state-of-the-art manipulation models to reveal that their success rate degrades between 30-50% across these perturbation factors. When multiple perturbations are applied in unison, the success rate degrades $\geq$75%. We identify that changing the number of distractor objects, target object color, or lighting conditions are the perturbations that reduce model performance the most. To verify the ecological validity of our results, we show that our results in simulation are correlated ($\bar{R}^2 = 0.614$) to similar perturbations in real-world experiments. We open source code for others to use THE COLOSSEUM, and also release code to 3D print the objects used to replicate the real-world perturbations. Ultimately, we hope that THE COLOSSEUM will serve as a benchmark to identify modeling decisions that systematically improve generalization for manipulation. See https://robot-colosseum.github.io/ for more details.

PDF Abstract

Code

Add Remove Mark official

robot-colosseum/robot-colosseum official

Tasks

Add Remove

Robot Manipulation Generalization

Datasets

Introduced in the Paper:

The COLOSSEUM

Used in the Paper:

RLBench

Results from the Paper

Add Remove

Ranked #1 on Robot Manipulation Generalization on The COLOSSEUM

Get a GitHub badge

Task	Dataset	Model	Metric Name	Metric Value	Global Rank	Benchmark
Robot Manipulation Generalization	The COLOSSEUM	RVT	Average decrease average across all perturbations	-16.316	# 1	Compare
Robot Manipulation Generalization	The COLOSSEUM	MVP	Average decrease average across all perturbations	-32.352	# 1	Compare
Robot Manipulation Generalization	The COLOSSEUM	R3M	Average decrease average across all perturbations	-66.595	# 1	Compare
Robot Manipulation Generalization	The COLOSSEUM	PerAct	Average decrease average across all perturbations	-15.526	# 1	Compare

Methods

Add Remove

No methods listed for this paper. Add relevant methods here

Edit Social Preview

THE COLOSSEUM: A Benchmark for Evaluating Generalization for Robotic Manipulation

Code Edit Add Remove Mark official

Tasks Edit Add Remove

Datasets Edit

Results from the Paper Edit Add Remove

Methods Edit Add Remove

Code

Add Remove Mark official

Tasks

Add Remove

Datasets

Results from the Paper

Add Remove

Methods

Add Remove