OSCAR

Introduced by Li et al. in Oscar: Object-Semantics Aligned Pre-training for Vision-Language Tasks

OSCAR is a new learning method that uses object tags detected in images as anchor points to ease the learning of image-text alignment. The model take a triple as input (word-tag-region) and pre-trained with two losses (masked token loss over words and tags, and a contrastive loss between tags and others). OSCAR represents an image-text pair into semantic space via dictionary lookup. Object tags are used as anchor points to align image regions with word embeddings of pre-trained language models. The model is then fine-tuned for understanding and generation tasks.

Source: Oscar: Object-Semantics Aligned Pre-training for Vision-Language Tasks

Read Paper See Code

Papers

Paper	Code	Results	Date	Stars

Tasks

Task	Papers	Share
Language Modelling	4	11.43%
Image Captioning	3	8.57%
Visual Question Answering (VQA)	3	8.57%
Question Answering	2	5.71%
NER	2	5.71%
Change Detection	1	2.86%
GPT-4	1	2.86%
Image Generation	1	2.86%
Text-to-Image Generation	1	2.86%

Usage Over Time

This feature is experimental; we are continuously improving our matching algorithm.

Components

Component	Type	Add Remove
🤖 No Components Found	You can add them if they exist; e.g. Mask R-CNN uses RoIAlign

Categories

Add Remove

Vision and Language Pre-Trained Models