Visual Entailment

27 papers with code • 3 benchmarks • 3 datasets

Visual Entailment (VE) - is a task consisting of image-sentence pairs whereby a premise is defined by an image, rather than a natural language sentence as in traditional Textual Entailment tasks. The goal is to predict whether the image semantically entails the text.

Libraries

Use these libraries to find Visual Entailment models and implementations
2 papers
2,327

Latest papers with no code

A survey on knowledge-enhanced multimodal learning

no code yet • 19 Nov 2022

Multimodal learning has been a field of increasing interest, aiming to combine various modalities in a single joint representation.

AlignVE: Visual Entailment Recognition Based on Alignment Relations

no code yet • 16 Nov 2022

Such approaches, however, ignore the VE's unique nature of relation inference between the premise and hypothesis.

Pre-training image-language transformers for open-vocabulary tasks

no code yet • 9 Sep 2022

We present a pre-training approach for vision and language transformer models, which is based on a mixture of diverse tasks.

Answer-Me: Multi-Task Open-Vocabulary Visual Question Answering

no code yet • 2 May 2022

We present Answer-Me, a task-aware multi-task framework which unifies a variety of question answering tasks, such as, visual question answering, visual entailment, visual reasoning.

Multimodal Adaptive Distillation for Leveraging Unimodal Encoders for Vision-Language Tasks

no code yet • 22 Apr 2022

Experiments demonstrate that MAD leads to consistent gains in the low-shot, domain-shifted, and fully-supervised conditions on VCR, SNLI-VE, and VQA, achieving SOTA performance on VCR compared to other single models pretrained with image-text data.

CLIP Models are Few-shot Learners: Empirical Studies on VQA and Visual Entailment

no code yet • ACL 2022

We first evaluate CLIP's zero-shot performance on a typical visual question answering task and demonstrate a zero-shot cross-modality transfer capability of CLIP on the visual entailment task.

Unsupervised Vision-and-Language Pre-training via Retrieval-based Multi-Granular Alignment

no code yet • CVPR 2022

We adapt our pre-trained model to a set of V+L downstream tasks, including VQA, NLVR2, Visual Entailment, and RefCOCO+.

CLIP-TD: CLIP Targeted Distillation for Vision-Language Tasks

no code yet • 15 Jan 2022

Experiments demonstrate that our proposed CLIP-TD leads to exceptional gains in the low-shot (up to 51. 9%) and domain-shifted (up to 71. 3%) conditions of VCR, while simultaneously improving performance under standard fully-supervised conditions (up to 2%), achieving state-of-art performance on VCR compared to other single models that are pretrained with image-text data only.

Logically at Factify 2022: Multimodal Fact Verification

no code yet • 16 Dec 2021

This paper describes our participant system for the multi-modal fact verification (Factify) challenge at AAAI 2022.

Unified Multimodal Pre-training and Prompt-based Tuning for Vision-Language Understanding and Generation

no code yet • 10 Dec 2021

Our experiments show that there is a trade-off between understanding tasks and generation tasks while using the same model, and a feasible way to improve both tasks is to use more data.