Cognition Guided Human-Object Relationship Detection

Human-object relationship detection reveals the fine-grained relationship between humans and objects, helping the comprehensive understanding of videos. Previous human-object relationship detection approaches are mainly developed with object features and relation features without exploring the specific information of humans. In this paper, we propose a novel Relation-Pose Transformer (RPT) for human-object relationship detection. Inspired by the coordination of eye-head-body movements in cognitive science, we employ the head pose to find those crucial objects that humans focus on and use the body pose with skeleton information to represent multiple actions. Then, we utilize the spatial encoder to capture spatial contextualized information of the relation pair, which integrates the relation features and pose features. Next, the temporal decoder aims to model the temporal dependency of the relationship. Finally, we adopt multiple classifiers to predict different types of relationships. Extensive experiments on the benchmark Action Genome validate the effectiveness of our proposed method and show the state-of-the-art performance compared with related methods.

PDF

Datasets


  Add Datasets introduced or used in this paper

Results from the Paper


  Submit results from this paper to get state-of-the-art GitHub badges and help the community compare results to other papers.

Methods