Visual Entailment
27 papers with code • 3 benchmarks • 3 datasets
Visual Entailment (VE) - is a task consisting of image-sentence pairs whereby a premise is defined by an image, rather than a natural language sentence as in traditional Textual Entailment tasks. The goal is to predict whether the image semantically entails the text.
Libraries
Use these libraries to find Visual Entailment models and implementationsLatest papers with no code
A survey on knowledge-enhanced multimodal learning
Multimodal learning has been a field of increasing interest, aiming to combine various modalities in a single joint representation.
AlignVE: Visual Entailment Recognition Based on Alignment Relations
Such approaches, however, ignore the VE's unique nature of relation inference between the premise and hypothesis.
Pre-training image-language transformers for open-vocabulary tasks
We present a pre-training approach for vision and language transformer models, which is based on a mixture of diverse tasks.
Answer-Me: Multi-Task Open-Vocabulary Visual Question Answering
We present Answer-Me, a task-aware multi-task framework which unifies a variety of question answering tasks, such as, visual question answering, visual entailment, visual reasoning.
Multimodal Adaptive Distillation for Leveraging Unimodal Encoders for Vision-Language Tasks
Experiments demonstrate that MAD leads to consistent gains in the low-shot, domain-shifted, and fully-supervised conditions on VCR, SNLI-VE, and VQA, achieving SOTA performance on VCR compared to other single models pretrained with image-text data.
CLIP Models are Few-shot Learners: Empirical Studies on VQA and Visual Entailment
We first evaluate CLIP's zero-shot performance on a typical visual question answering task and demonstrate a zero-shot cross-modality transfer capability of CLIP on the visual entailment task.
Unsupervised Vision-and-Language Pre-training via Retrieval-based Multi-Granular Alignment
We adapt our pre-trained model to a set of V+L downstream tasks, including VQA, NLVR2, Visual Entailment, and RefCOCO+.
CLIP-TD: CLIP Targeted Distillation for Vision-Language Tasks
Experiments demonstrate that our proposed CLIP-TD leads to exceptional gains in the low-shot (up to 51. 9%) and domain-shifted (up to 71. 3%) conditions of VCR, while simultaneously improving performance under standard fully-supervised conditions (up to 2%), achieving state-of-art performance on VCR compared to other single models that are pretrained with image-text data only.
Logically at Factify 2022: Multimodal Fact Verification
This paper describes our participant system for the multi-modal fact verification (Factify) challenge at AAAI 2022.
Unified Multimodal Pre-training and Prompt-based Tuning for Vision-Language Understanding and Generation
Our experiments show that there is a trade-off between understanding tasks and generation tasks while using the same model, and a feasible way to improve both tasks is to use more data.