Video Visual Relation Detection

7 papers with code • 2 benchmarks • 2 datasets

Video Visual Relation Detection (VidVRD) aims to detect instances of visual relations of interest in a video, where a visual relation instance is represented by a relation triplet <subject, predicate, object> with the trajectories of the subject and object. As compared to still images, videos provide a more natural set of features for detecting visual relations, such as the dynamic relations like “A-follow-B” and “A-towards-B”, and temporally changing relations like “A-chase-B” followed by “A-hold-B”. Yet, VidVRD is technically more challenging than ImgVRD due to the difficulties in accurate object tracking and diverse relation appearances in the video domain.

Source: ImageNet-VidVRD Video Visual Relation Dataset

Most implemented papers

Video Relationship Reasoning using Gated Spatio-Temporal Energy Graph

yaohungt/GSTEG_CVPR_2019 CVPR 2019

Visual relationship reasoning is a crucial yet challenging task for understanding rich interactions across visual concepts.

LIGHTEN: Learning Interactions with Graph and Hierarchical TEmporal Networks for HOI in videos

praneeth11009/LIGHTEN-Learning-Interactions-with-Graphs-and-Hierarchical-TEmporal-Networks-for-HOI 17 Dec 2020

Analyzing the interactions between humans and objects from a video includes identification of the relationships between humans and the objects present in the video.

What and When to Look?: Temporal Span Proposal Network for Video Relation Detection

sangminwoo/Temporal-Span-Proposal-Network-VidVRD 15 Jul 2021

TSPN tells when to look: it simultaneously predicts start-end timestamps (i. e., temporal spans) and categories of the all possible relations by utilizing full video context.

Spatial-Temporal Transformer for Dynamic Scene Graph Generation

yrcong/sttran ICCV 2021

Compared to the task of scene graph generation from images, it is more challenging because of the dynamic relationships between objects and the temporal dependencies between frames allowing for a richer semantic interpretation.

Social Fabric: Tubelet Compositions for Video Relation Detection

shanshuo/social-fabric ICCV 2021

We also propose Social Fabric: an encoding that represents a pair of object tubelets as a composition of interaction primitives.

Video Relation Detection via Tracklet based Visual Transformer

dawn-lx/vidvrd-tracklets 19 Aug 2021

Video Visual Relation Detection (VidVRD), has received significant attention of our community over recent years.

Compositional Prompt Tuning with Motion Cues for Open-vocabulary Video Relation Detection

dawn-lx/openvoc-vidvrd 1 Feb 2023

Without bells and whistles, our RePro achieves a new state-of-the-art performance on two VidVRD benchmarks of not only the base training object and predicate categories, but also the unseen ones.