Vision and Language Pre-Trained Models

OSCAR is a new learning method that uses object tags detected in images as anchor points to ease the learning of image-text alignment. The model take a triple as input (word-tag-region) and pre-trained with two losses (masked token loss over words and tags, and a contrastive loss between tags and others). OSCAR represents an image-text pair into semantic space via dictionary lookup. Object tags are used as anchor points to align image regions with word embeddings of pre-trained language models. The model is then fine-tuned for understanding and generation tasks.

Source: Oscar: Object-Semantics Aligned Pre-training for Vision-Language Tasks

Papers


Paper Code Results Date Stars

Tasks


Task Papers Share
Language Modelling 4 11.43%
Image Captioning 3 8.57%
Visual Question Answering (VQA) 3 8.57%
Question Answering 2 5.71%
NER 2 5.71%
Change Detection 1 2.86%
GPT-4 1 2.86%
Image Generation 1 2.86%
Text-to-Image Generation 1 2.86%

Components


Component Type
🤖 No Components Found You can add them if they exist; e.g. Mask R-CNN uses RoIAlign

Categories