The COLOSSEUM (The COLOSSEUM: A Benchmark for Evaluating Generalization for Robotic Manipulation)

Introduced by Pumacay et al. in THE COLOSSEUM: A Benchmark for Evaluating Generalization for Robotic Manipulation

To realize effective large-scale, real-world robotic applications, we must evaluate how well our robot policies adapt to changes in environmental conditions. Unfortunately, a majority of studies evaluate robot performance in environments closely resembling or even identical to the training setup.

We present Colosseum, a novel simulation benchmark,with 20 diverse manipulation tasks, that enables systematical evaluation of models across 12 axes of environmental perturbations. These perturbations include changes in color, texture, and size of objects, table-tops, and backgrounds; we also vary lighting, distractors, and camera pose. Using Colosseum, we compare 4 state-of-the-art manipulation models to reveal that their success rate degrades between 30-50% across these perturbation factors.

When multiple perturbations are applied in unison, the success rate degrades > 75%. We identify that changing the number of distractor objects, target object color, or lighting conditions are the perturbations that reduce model performance the most. To verify the ecological validity of our results, we show that our results in simulation are correlated (R2 = 0.614) to similar perturbations in real-world experiments. We open source code for others to use Colosseum, and also release code to 3D print the objects used to replicate the real-world perturbations. Ultimately, we hope that Colosseum will serve as a benchmark to identify modeling decisions that systematically improve generalization for manipulation.

Homepage

Benchmarks

Add a new result Link an existing benchmark

Trend	Task	Dataset Variant	Best Model	Paper	Code
	Robot Manipulation Generalization	The COLOSSEUM	PerAct

Papers

Paper	Code	Results	Date	Stars

Dataset Loaders

Add Remove

No data loaders found. You can submit your data loader here.

Tasks

Robot Manipulation Generalization

The COLOSSEUM (The COLOSSEUM: A Benchmark for Evaluating Generalization for Robotic Manipulation)

Benchmarks

Add a new result Link an existing benchmark

Papers

Dataset Loaders

Add Remove

Tasks

Similar Datasets

RLBench

Usage

License

Modalities

Languages

The COLOSSEUM (The COLOSSEUM: A Benchmark for Evaluating Generalization for Robotic Manipulation)

Benchmarks Edit Add a new result Link an existing benchmark

Papers

Dataset Loaders Edit Add Remove

Tasks Edit

Similar Datasets

RLBench

Usage

License Edit

Modalities Edit

Languages Edit

Benchmarks

Add a new result Link an existing benchmark

Dataset Loaders

Add Remove

Tasks

License

Modalities

Languages