Visual Entailment

27 papers with code • 3 benchmarks • 3 datasets

Visual Entailment (VE) - is a task consisting of image-sentence pairs whereby a premise is defined by an image, rather than a natural language sentence as in traditional Textual Entailment tasks. The goal is to predict whether the image semantically entails the text.

Libraries

Use these libraries to find Visual Entailment models and implementations
2 papers
2,327

Prompt Tuning for Generative Multimodal Pretrained Models

ofa-sys/ofa 4 Aug 2022

Prompt tuning has become a new paradigm for model tuning and it has demonstrated success in natural language pretraining and even vision pretraining.

2,327
04 Aug 2022

Chunk-aware Alignment and Lexical Constraint for Visual Entailment with Natural Language Explanations

HITsz-TMG/ExplainableVisualEntailment 23 Jul 2022

CSI), a relation inferrer, and a Lexical Constraint-aware Generator (arr.

4
23 Jul 2022

MixGen: A New Multi-Modal Data Augmentation

amazon-research/mix-generation 16 Jun 2022

Data augmentation is a necessity to enhance data efficiency in deep learning.

106
16 Jun 2022

CoCa: Contrastive Captioners are Image-Text Foundation Models

mlfoundations/open_clip 4 May 2022

We apply a contrastive loss between unimodal image and text embeddings, in addition to a captioning loss on the multimodal decoder outputs which predicts text tokens autoregressively.

8,494
04 May 2022

Visual Spatial Reasoning

cambridgeltl/visual-spatial-reasoning 30 Apr 2022

Spatial relations are a basic part of human cognition.

85
30 Apr 2022

Fine-Grained Visual Entailment

skrighyz/fgve 29 Mar 2022

In this paper, we propose an extension of this task, where the goal is to predict the logical relationship of fine-grained knowledge elements within a piece of text to an image.

11
29 Mar 2022

NLX-GPT: A Model for Natural Language Explanations in Vision and Vision-Language Tasks

fawazsammani/nlxgpt CVPR 2022

Current NLE models explain the decision-making process of a vision or vision-language model (a. k. a., task model), e. g., a VQA model, via a language model (a. k. a., explanation model), e. g., GPT.

41
09 Mar 2022

OFA: Unifying Architectures, Tasks, and Modalities Through a Simple Sequence-to-Sequence Learning Framework

modelscope/modelscope 7 Feb 2022

In this work, we pursue a unified paradigm for multimodal pretraining to break the scaffolds of complex task/modality-specific customization.

6,091
07 Feb 2022

Distilled Dual-Encoder Model for Vision-Language Understanding

kugwzk/distilled-dualencoder 16 Dec 2021

We propose a cross-modal attention distillation framework to train a dual-encoder model for vision-language understanding tasks, such as visual reasoning and visual question answering.

26
16 Dec 2021

Check It Again:Progressive Visual Question Answering via Visual Entailment

PhoebusSi/SAR ACL 2021

Besides, they only explore the interaction between image and question, ignoring the semantics of candidate answers.

31
01 Aug 2021