In Visual Query Detection (VQD), a system is given a query (prompt) natural language and an image, and then the system must produce 0 - N boxes that satisfy that query. VQD is related to several other tasks in computer vision, but it captures abilities these other tasks ignore. Unlike object detection, VQD can deal with attributes and relations among objects in the scene. In VQA, often algorithms produce the right answers due to dataset bias without `looking' at relevant image regions. Referring Expression Recognition (RER) datasets have short and often ambiguous prompts, and by having only a single box as an output, they make it easier to exploit dataset biases. VQD requires goal-directed object detection and outputting a variable number of boxes that answer a query.
In VQDv1 the number of bounding boxes per image ranges from 0-15. VQDv1 has 123K images and 621K questions, where the questions are divided into these categories: 391K Simple Questions, 172K Color Questions, and 58K Positional Reasoning Questions.
Paper | Code | Results | Date | Stars |
---|