Natural language questions are inherently compositional, and many are most easily answered by reasoning about their decomposition into modular sub-problems.
#16 best model for Visual Question Answering on VQA v2 test-dev
Our approach efficiently detects objects in an image while simultaneously generating a high-quality segmentation mask for each instance.
3D INSTANCE SEGMENTATION HUMAN PART SEGMENTATION KEYPOINT DETECTION MULTI-HUMAN PARSING MULTI-PERSON POSE ESTIMATION MULTI-TISSUE NUCLEUS SEGMENTATION NUCLEAR SEGMENTATION PANOPTIC SEGMENTATION REAL-TIME OBJECT DETECTION
Our novel Focal Loss focuses training on a sparse set of hard examples and prevents the vast number of easy negatives from overwhelming the detector during training.
#3 best model for Dense Object Detection on SKU-110K
We propose an attentive local feature descriptor suitable for large-scale image retrieval, referred to as DELF (DEep Local Feature).
#2 best model for Image Retrieval on Oxf5k
Image-to-image translation is a class of vision and graphics problems where the goal is to learn the mapping between an input image and an output image using a training set of aligned image pairs.
#2 best model for Multimodal Unsupervised Image-To-Image Translation on EPFL NIR-VIS
Depending on the task complexity, thousands to millions of labeled image pairs are needed to train a conditional GAN.
#2 best model for Image-to-Image Translation on Aerial-to-Map
Our CNN works with just a single 2D facial image, does not require accurate alignment nor establishes dense correspondence between images, works for arbitrary facial poses and expressions, and can be used to reconstruct the whole 3D facial geometry (including the non-visible parts of the face) bypassing the construction (during training) and fitting (during testing) of a 3D Morphable Model.
#2 best model for 3D Face Reconstruction on Florence
In this paper, we propose a novel regional multi-person pose estimation (RMPE) framework to facilitate pose estimation in the presence of inaccurate human bounding boxes.
To this end, we make the following 5 contributions: (a) we construct, for the first time, a very strong baseline by combining a state-of-the-art architecture for landmark localization with a state-of-the-art residual block, train it on a very large yet synthetically expanded 2D facial landmark dataset and finally evaluate it on all other 2D facial landmark datasets.
SOTA for Face Alignment on 300-VW (C)