Visual Entailment

27 papers with code • 3 benchmarks • 3 datasets

Visual Entailment (VE) - is a task consisting of image-sentence pairs whereby a premise is defined by an image, rather than a natural language sentence as in traditional Textual Entailment tasks. The goal is to predict whether the image semantically entails the text.

Libraries

Use these libraries to find Visual Entailment models and implementations
2 papers
2,323

Latest papers with no code

VEglue: Testing Visual Entailment Systems via Object-Aligned Joint Erasing

no code yet • 5 Mar 2024

Visual entailment (VE) is a multimodal reasoning task consisting of image-sentence pairs whereby a promise is defined by an image, and a hypothesis is described by a sentence.

ArcSin: Adaptive ranged cosine Similarity injected noise for Language-Driven Visual Tasks

no code yet • 27 Feb 2024

We train models for these tasks in a zero-shot cross-modal transfer setting, a domain where the previous state-of-the-art method relied on the fixed scale noise injection, often compromising the semantic content of the original modality embedding.

LLMs as Bridges: Reformulating Grounded Multimodal Named Entity Recognition

no code yet • 15 Feb 2024

Grounded Multimodal Named Entity Recognition (GMNER) is a nascent multimodal task that aims to identify named entities, entity types and their corresponding visual regions.

Lightweight In-Context Tuning for Multimodal Unified Models

no code yet • 8 Oct 2023

In-context learning (ICL) involves reasoning from given contextual examples.

"Let's not Quote out of Context": Unified Vision-Language Pretraining for Context Assisted Image Captioning

no code yet • 1 Jun 2023

We exploit context by pretraining our model with datasets of three tasks: news image captioning where the news article is the context, contextual visual entailment, and keyword extraction from the context.

Understanding and Constructing Latent Modality Structures in Multi-modal Representation Learning

no code yet • CVPR 2023

Hence we advocate that the key of better performance lies in meaningful latent modality structures instead of perfect modality alignment.

Few-shot Multimodal Multitask Multilingual Learning

no code yet • 19 Feb 2023

While few-shot learning as a transfer learning paradigm has gained significant traction for scenarios with limited data, it has primarily been explored in the context of building unimodal and unilingual models.

Benchmarking Robustness of Multimodal Image-Text Models under Distribution Shift

no code yet • 15 Dec 2022

Multimodal image-text models have shown remarkable performance in the past few years.

Compound Tokens: Channel Fusion for Vision-Language Representation Learning

no code yet • 2 Dec 2022

We concatenate all the compound tokens for further processing with multimodal encoder.

A survey on knowledge-enhanced multimodal learning

no code yet • 19 Nov 2022

Multimodal learning has been a field of increasing interest, aiming to combine various modalities in a single joint representation.