Video Visual Relation Detection
7 papers with code • 2 benchmarks • 2 datasets
Video Visual Relation Detection (VidVRD) aims to detect instances of visual relations of interest in a video, where a visual relation instance is represented by a relation triplet <subject, predicate, object> with the trajectories of the subject and object. As compared to still images, videos provide a more natural set of features for detecting visual relations, such as the dynamic relations like “A-follow-B” and “A-towards-B”, and temporally changing relations like “A-chase-B” followed by “A-hold-B”. Yet, VidVRD is technically more challenging than ImgVRD due to the difficulties in accurate object tracking and diverse relation appearances in the video domain.
Most implemented papers
Video Relationship Reasoning using Gated Spatio-Temporal Energy Graph
Visual relationship reasoning is a crucial yet challenging task for understanding rich interactions across visual concepts.
LIGHTEN: Learning Interactions with Graph and Hierarchical TEmporal Networks for HOI in videos
Analyzing the interactions between humans and objects from a video includes identification of the relationships between humans and the objects present in the video.
What and When to Look?: Temporal Span Proposal Network for Video Relation Detection
TSPN tells when to look: it simultaneously predicts start-end timestamps (i. e., temporal spans) and categories of the all possible relations by utilizing full video context.
Spatial-Temporal Transformer for Dynamic Scene Graph Generation
Compared to the task of scene graph generation from images, it is more challenging because of the dynamic relationships between objects and the temporal dependencies between frames allowing for a richer semantic interpretation.
Social Fabric: Tubelet Compositions for Video Relation Detection
We also propose Social Fabric: an encoding that represents a pair of object tubelets as a composition of interaction primitives.
Video Relation Detection via Tracklet based Visual Transformer
Video Visual Relation Detection (VidVRD), has received significant attention of our community over recent years.
Compositional Prompt Tuning with Motion Cues for Open-vocabulary Video Relation Detection
Without bells and whistles, our RePro achieves a new state-of-the-art performance on two VidVRD benchmarks of not only the base training object and predicate categories, but also the unseen ones.