We demonstrate the practical interest of our fusion model by using BLOCK for two challenging tasks: Visual Question Answering (VQA) and Visual Relationship Detection (VRD), where we design end-to-end learnable architectures for representing relevant interactions between modalities.
QUESTION ANSWERING REPRESENTATION LEARNING VISUAL QUESTION ANSWERING VISUAL RELATIONSHIP DETECTION
Generating scene graph to describe all the relations inside an image gains increasing interests these years.
Ranked #1 on
Scene Graph Generation
on VRD
GRAPH GENERATION SCENE GRAPH GENERATION VISUAL RELATIONSHIP DETECTION
In this paper, we propose a novel framework, called Deep Structural Ranking, for visual relationship detection.
This requires the detection of visual relationships: triples (subject, relation, object) describing a semantic relation between a subject and an object.
RELATIONAL REASONING TENSOR NETWORKS VISUAL RELATIONSHIP DETECTION ZERO-SHOT LEARNING
To capture such global interdependency, we propose a deep Variation-structured Reinforcement Learning (VRL) framework to sequentially discover object relationships and attributes in the whole image.
We present Open Images V4, a dataset of 9. 2M images with unified annotations for image classification, object detection and visual relationship detection.
IMAGE CLASSIFICATION OBJECT DETECTION VISUAL RELATIONSHIP DETECTION
This paper presents a framework for localization or grounding of phrases in images using a large collection of linguistic and visual cues.
We propose to study the task of Long-Tail Visual Relationship Recognition (LTVRR), which aims at generalizing on the structured long-tail distribution of visual relationships (e. g., "rabbit grazing on grass").
Visual relationship detection, as a challenging task used to find and distinguish the interactions between object pairs in one image, has received much attention recently.
Analyzing the interactions between humans and objects from a video includes identification of the relationships between humans and the objects present in the video.
HUMAN-OBJECT INTERACTION DETECTION VISUAL RELATIONSHIP DETECTION