Harmonious Feature Learning for Interactive Hand-Object Pose Estimation

Joint hand and object pose estimation from a single image is extremely challenging as serious occlusion often occurs when the hand and object interact. Existing approaches typically first extract coarse hand and object features from a single backbone, then further enhance them with reference to each other via interaction modules. However, these works usually ignore that the hand and object are competitive in feature learning, since the backbone takes both of them as foreground and they are usually mutually occluded. In this paper, we propose a novel Harmonious Feature Learning Network (HFL-Net). HFL-Net introduces a new framework that combines the advantages of single- and double-stream backbones: it shares the parameters of the low- and high-level convolutional layers of a common ResNet-50 model for the hand and object, leaving the middle-level layers unshared. This strategy enables the hand and the object to be extracted as the sole targets by the middle-level layers, avoiding their competition in feature learning. The shared high-level layers also force their features to be harmonious, thereby facilitating their mutual feature enhancement. In particular, we propose to enhance the feature of the hand via concatenation with the feature in the same location from the object stream. A subsequent self-attention layer is adopted to deeply fuse the concatenated feature. Experimental results show that our proposed approach consistently outperforms state-of-the-art methods on the popular HO3D and Dex-YCB databases. Notably, the performance of our model on hand pose estimation even surpasses that of existing works that only perform the single-hand pose estimation task. Code is available at https://github.com/lzfff12/HFL-Net.

PDF Abstract

Datasets


Results from the Paper


Task Dataset Model Metric Name Metric Value Global Rank Benchmark
hand-object pose DexYCB HFL-Net Average MPJPE (mm) 11.9 # 2
Procrustes-Aligned MPJPE 5.81 # 2
OCE 39.8 # 5
MCE 45.7 # 3
ADD-S 31.9 # 3
3D Hand Pose Estimation DexYCB HFLNet Average MPJPE (mm) 12.6 # 3
Procrustes-Aligned MPJPE 5.47 # 2
MPVPE 11.6 # 2
VAUC 77.6 # 2
PA-MPVPE 5.2 # 2
PA-VAUC 89.6 # 2
hand-object pose HO-3D HFL-Net Average MPJPE (mm) 28.9 # 5
ST-MPJPE 28.4 # 6
PA-MPJPE 8.9 # 1
OME 64.3 # 3
ADD-S 32.4 # 5
3D Hand Pose Estimation HO-3D HFLNet Average MPJPE (mm) 28.9 # 8
ST-MPJPE (mm) 28.4 # 12
PA-MPJPE (mm) 8.9 # 3

Methods


No methods listed for this paper. Add relevant methods here