Modality and Component Aware Feature Fusion For RGB-D Scene Classification

CVPR 2016 · Anran Wang, Jianfei Cai, Jiwen Lu, Tat-Jen Cham ·

While convolutional neural networks (CNN) have been excellent for object recognition, the greater spatial variability in scene images typically meant that the standard full-image CNN features are suboptimal for scene classification. In this paper, we investigate a framework allowing greater spatial flexibility, in which the Fisher vector (FV) encoded distribution of local CNN features, obtained from a multitude of region proposals per image, is considered instead. The CNN features are computed from an augmented pixel-wise representation comprising multiple modalities of RGB, HHA and surface normals, as extracted from RGB-D data. More significantly, we make two postulates: (1) component sparsity --- that only a small variety of region proposals and their corresponding FV GMM components contribute to scene discriminability, and (2) modal non-sparsity --- within these discriminative components, all modalities have important contribution. In our framework, these are implemented through regularization terms applying group lasso to GMM components and exclusive group lasso across modalities. By learning and combining regressors for both proposal-based FV features and global CNN features, we were able to achieve state-of-the-art scene classification performance on the SUNRGBD Dataset and NYU Depth Dataset V2.

PDF Abstract