|TREND||DATASET||BEST METHOD||PAPER TITLE||PAPER||CODE||COMPARE|
A disentangled representation encodes information about the salient factors of variation in the data independently.
We present the MAC network, a novel fully differentiable neural network architecture, designed to facilitate explicit and expressive reasoning.
In LXMERT, we build a large-scale Transformer model that consists of three encoders: an object relationship encoder, a language encoder, and a cross-modality encoder.
Ranked #1 on Visual Question Answering on VizWiz
We propose to compose dynamic tree structures that place the objects in an image into a visual context, helping visual reasoning tasks such as scene graph generation and visual Q&A.
Ranked #2 on Scene Graph Generation on Visual Genome
We introduce a general-purpose conditioning method for neural networks called FiLM: Feature-wise Linear Modulation.
Existing methods for visual reasoning attempt to directly map inputs to outputs using black-box architectures without explicitly modeling the underlying reasoning processes.