TASK	DATASET	MODEL	METRIC NAME	METRIC VALUE	GLOBAL RANK
Semantic Segmentation	NYU Depth v2	EMSAFormer (SwinV2-T-128-Multi-Aug)	Mean IoU	51.26%	# 40
Semantic Segmentation	ScanNetV2	EMSAFormer	Mean IoU	56.4%	# 5
Semantic Segmentation	SUN-RGBD	EMSANet (2x ResNet-34 NBt1D, PanopticNDT version, finetuned)	Mean IoU	48.82%	# 18

Badge	Markdown
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/efficient-multi-task-scene-analysis-with-rgb/semantic-segmentation-on-scannetv2)](https://paperswithcode.com/sota/semantic-segmentation-on-scannetv2?p=efficient-multi-task-scene-analysis-with-rgb)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/efficient-multi-task-scene-analysis-with-rgb/semantic-segmentation-on-sun-rgbd)](https://paperswithcode.com/sota/semantic-segmentation-on-sun-rgbd?p=efficient-multi-task-scene-analysis-with-rgb)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/efficient-multi-task-scene-analysis-with-rgb/semantic-segmentation-on-nyu-depth-v2)](https://paperswithcode.com/sota/semantic-segmentation-on-nyu-depth-v2?p=efficient-multi-task-scene-analysis-with-rgb)`

Efficient Multi-Task Scene Analysis with RGB-D Transformers

8 Jun 2023 · Söhnke Benedikt Fischedick, Daniel Seichter, Robin Schmidt, Leonard Rabes, Horst-Michael Gross ·

Scene analysis is essential for enabling autonomous systems, such as mobile robots, to operate in real-world environments. However, obtaining a comprehensive understanding of the scene requires solving multiple tasks, such as panoptic segmentation, instance orientation estimation, and scene classification. Solving these tasks given limited computing and battery capabilities on mobile platforms is challenging. To address this challenge, we introduce an efficient multi-task scene analysis approach, called EMSAFormer, that uses an RGB-D Transformer-based encoder to simultaneously perform the aforementioned tasks. Our approach builds upon the previously published EMSANet. However, we show that the dual CNN-based encoder of EMSANet can be replaced with a single Transformer-based encoder. To achieve this, we investigate how information from both RGB and depth data can be effectively incorporated in a single encoder. To accelerate inference on robotic hardware, we provide a custom NVIDIA TensorRT extension enabling highly optimization for our EMSAFormer approach. Through extensive experiments on the commonly used indoor datasets NYUv2, SUNRGB-D, and ScanNet, we show that our approach achieves state-of-the-art performance while still enabling inference with up to 39.1 FPS on an NVIDIA Jetson AGX Orin 32 GB.

PDF Abstract

Code

Add Remove Mark official

tui-nicr/nicr-scene-analysis-datase… official

tui-nicr/emsaformer official

Tasks

Add Remove

Panoptic Segmentation

Scene Classification

Semantic Segmentation

Datasets

ScanNet

NYUv2

SUN RGB-D

Results from the Paper

Add Remove

Ranked #5 on Semantic Segmentation on ScanNetV2

Get a GitHub badge

Task	Dataset	Model	Metric Name	Metric Value	Global Rank	Benchmark
Semantic Segmentation	NYU Depth v2	EMSAFormer (SwinV2-T-128-Multi-Aug)	Mean IoU	51.26%	# 40	Compare
Semantic Segmentation	ScanNetV2	EMSAFormer	Mean IoU	56.4%	# 5	Compare
Semantic Segmentation	SUN-RGBD	EMSANet (2x ResNet-34 NBt1D, PanopticNDT version, finetuned)	Mean IoU	48.82%	# 18	Compare

Methods

Add Remove

No methods listed for this paper. Add relevant methods here

Edit Social Preview

Efficient Multi-Task Scene Analysis with RGB-D Transformers

Code Edit Add Remove Mark official

Tasks Edit Add Remove

Datasets Edit

Results from the Paper Edit Add Remove

Methods Edit Add Remove

Code

Add Remove Mark official

Tasks

Add Remove

Datasets

Results from the Paper

Add Remove

Methods

Add Remove