Visual Entailment
27 papers with code • 3 benchmarks • 3 datasets
Visual Entailment (VE) - is a task consisting of image-sentence pairs whereby a premise is defined by an image, rather than a natural language sentence as in traditional Textual Entailment tasks. The goal is to predict whether the image semantically entails the text.
Libraries
Use these libraries to find Visual Entailment models and implementationsLatest papers
Prompt Tuning for Generative Multimodal Pretrained Models
Prompt tuning has become a new paradigm for model tuning and it has demonstrated success in natural language pretraining and even vision pretraining.
Chunk-aware Alignment and Lexical Constraint for Visual Entailment with Natural Language Explanations
CSI), a relation inferrer, and a Lexical Constraint-aware Generator (arr.
MixGen: A New Multi-Modal Data Augmentation
Data augmentation is a necessity to enhance data efficiency in deep learning.
CoCa: Contrastive Captioners are Image-Text Foundation Models
We apply a contrastive loss between unimodal image and text embeddings, in addition to a captioning loss on the multimodal decoder outputs which predicts text tokens autoregressively.
Visual Spatial Reasoning
Spatial relations are a basic part of human cognition.
Fine-Grained Visual Entailment
In this paper, we propose an extension of this task, where the goal is to predict the logical relationship of fine-grained knowledge elements within a piece of text to an image.
NLX-GPT: A Model for Natural Language Explanations in Vision and Vision-Language Tasks
Current NLE models explain the decision-making process of a vision or vision-language model (a. k. a., task model), e. g., a VQA model, via a language model (a. k. a., explanation model), e. g., GPT.
OFA: Unifying Architectures, Tasks, and Modalities Through a Simple Sequence-to-Sequence Learning Framework
In this work, we pursue a unified paradigm for multimodal pretraining to break the scaffolds of complex task/modality-specific customization.
Distilled Dual-Encoder Model for Vision-Language Understanding
We propose a cross-modal attention distillation framework to train a dual-encoder model for vision-language understanding tasks, such as visual reasoning and visual question answering.
Check It Again:Progressive Visual Question Answering via Visual Entailment
Besides, they only explore the interaction between image and question, ignoring the semantics of candidate answers.