4D Spatio-Temporal ConvNets: Minkowski Convolutional Neural Networks

In many robotics and VR/AR applications, 3D-videos are readily-available sources of input (a continuous sequence of depth images, or LIDAR scans). However, those 3D-videos are processed frame-by-frame either through 2D convnets or 3D perception algorithms. In this work, we propose 4-dimensional convolutional neural networks for spatio-temporal perception that can directly process such 3D-videos using high-dimensional convolutions. For this, we adopt sparse tensors and propose the generalized sparse convolution that encompasses all discrete convolutions. To implement the generalized sparse convolution, we create an open-source auto-differentiation library for sparse tensors that provides extensive functions for high-dimensional convolutional neural networks. We create 4D spatio-temporal convolutional neural networks using the library and validate them on various 3D semantic segmentation benchmarks and proposed 4D datasets for 3D-video perception. To overcome challenges in the 4D space, we propose the hybrid kernel, a special case of the generalized sparse convolution, and the trilateral-stationary conditional random field that enforces spatio-temporal consistency in the 7D space-time-chroma space. Experimentally, we show that convolutional neural networks with only generalized 3D sparse convolutions can outperform 2D or 2D-3D hybrid methods by a large margin. Also, we show that on 3D-videos, 4D spatio-temporal convolutional neural networks are robust to noise, outperform 3D convolutional neural networks and are faster than the 3D counterpart in some cases.

PDF Abstract CVPR 2019 PDF CVPR 2019 Abstract
Task Dataset Model Metric Name Metric Value Global Rank Result Benchmark
Robust 3D Semantic Segmentation nuScenes-C MinkUNet-34 mean Corruption Error (mCE) 96.37% # 2
Robust 3D Semantic Segmentation nuScenes-C MinkUNet-18 mean Corruption Error (mCE) 100.00% # 5
Semantic Segmentation S3DIS MinkowskiNet Mean IoU 65.4 # 35
Number of params 37.9M # 50
Params (M) 37.9 # 3
Semantic Segmentation S3DIS Area5 MinkowskiNet mIoU 65.4 # 36
mAcc 71.7 # 29
Number of params 37.9M # 52
Semantic Segmentation ScanNet MinkowskiNet test mIoU 73.4 # 15
val mIoU 72.2 # 17
3D Semantic Segmentation ScanNet++ MinkowskiNet Top-1 IoU 0.292 # 2
3D Semantic Segmentation ScanNet200 MinkUNet val mIoU 25.0 # 9
test mIoU 25.3 # 7
3D Semantic Segmentation ScribbleKITTI MinkowskiNet mIoU 55.0 # 3
Robust 3D Semantic Segmentation SemanticKITTI-C MinkUNet-34 mean Corruption Error (mCE) 100.61% # 5
Robust 3D Semantic Segmentation SemanticKITTI-C MinkUNet-18 mean Corruption Error (mCE) 100.00% # 3
Robust 3D Semantic Segmentation WOD-C MinkUNet-18 mean Corruption Error (mCE) 100.00% # 3
Robust 3D Semantic Segmentation WOD-C MinkUNet-34 mean Corruption Error (mCE) 96.21% # 1

Methods