Visual Entailment

27 papers with code • 3 benchmarks • 3 datasets

Visual Entailment (VE) - is a task consisting of image-sentence pairs whereby a premise is defined by an image, rather than a natural language sentence as in traditional Textual Entailment tasks. The goal is to predict whether the image semantically entails the text.

Benchmarks

Add a Result

These leaderboards are used to track progress in Visual Entailment

Dataset	Best Model	Compare
SNLI-VE val	OFA	See all
SNLI-VE test	OFA	See all
e-SNLI-VE	OFA-X	See all

Libraries

Use these libraries to find Visual Entailment models and implementations

ofa-sys/ofa

2 papers

2,327

Datasets

Latest papers with no code

Most implemented Social Latest No code

A survey on knowledge-enhanced multimodal learning

no code yet • 19 Nov 2022

Multimodal learning has been a field of increasing interest, aiming to combine various modalities in a single joint representation.

Paper
Add Code

AlignVE: Visual Entailment Recognition Based on Alignment Relations

no code yet • 16 Nov 2022

Such approaches, however, ignore the VE's unique nature of relation inference between the premise and hypothesis.

Paper
Add Code

Pre-training image-language transformers for open-vocabulary tasks

no code yet • 9 Sep 2022

We present a pre-training approach for vision and language transformer models, which is based on a mixture of diverse tasks.

Paper
Add Code

Answer-Me: Multi-Task Open-Vocabulary Visual Question Answering

no code yet • 2 May 2022

We present Answer-Me, a task-aware multi-task framework which unifies a variety of question answering tasks, such as, visual question answering, visual entailment, visual reasoning.

Paper
Add Code

Multimodal Adaptive Distillation for Leveraging Unimodal Encoders for Vision-Language Tasks

no code yet • 22 Apr 2022

Experiments demonstrate that MAD leads to consistent gains in the low-shot, domain-shifted, and fully-supervised conditions on VCR, SNLI-VE, and VQA, achieving SOTA performance on VCR compared to other single models pretrained with image-text data.

Paper
Add Code

CLIP Models are Few-shot Learners: Empirical Studies on VQA and Visual Entailment

no code yet • ACL 2022

We first evaluate CLIP's zero-shot performance on a typical visual question answering task and demonstrate a zero-shot cross-modality transfer capability of CLIP on the visual entailment task.

Paper
Add Code

Unsupervised Vision-and-Language Pre-training via Retrieval-based Multi-Granular Alignment

no code yet • CVPR 2022

We adapt our pre-trained model to a set of V+L downstream tasks, including VQA, NLVR2, Visual Entailment, and RefCOCO+.

Paper
Add Code

CLIP-TD: CLIP Targeted Distillation for Vision-Language Tasks

no code yet • 15 Jan 2022

Experiments demonstrate that our proposed CLIP-TD leads to exceptional gains in the low-shot (up to 51. 9%) and domain-shifted (up to 71. 3%) conditions of VCR, while simultaneously improving performance under standard fully-supervised conditions (up to 2%), achieving state-of-art performance on VCR compared to other single models that are pretrained with image-text data only.

Paper
Add Code

Logically at Factify 2022: Multimodal Fact Verification

no code yet • 16 Dec 2021

This paper describes our participant system for the multi-modal fact verification (Factify) challenge at AAAI 2022.

Paper
Add Code

Unified Multimodal Pre-training and Prompt-based Tuning for Vision-Language Understanding and Generation

no code yet • 10 Dec 2021

Our experiments show that there is a trade-off between understanding tasks and generation tasks while using the same model, and a feasible way to improve both tasks is to use more data.

Paper
Add Code

Visual Entailment

Benchmarks Add a Result

Libraries

Datasets

Latest papers with no code

Content

Benchmarks

Add a Result