Image-text matching

84 papers with code • 1 benchmarks • 1 datasets

This task has no description! Would you like to contribute one?

Libraries

Use these libraries to find Image-text matching models and implementations
2 papers
8,821

Most implemented papers

Visual Semantic Reasoning for Image-Text Matching

KunpengLi1994/VSRN ICCV 2019

It outperforms the current best method by 6. 8% relatively for image retrieval and 4. 8% relatively for caption retrieval on MS-COCO (Recall@1 using 1K test set).

ECCV Caption: Correcting False Negatives by Collecting Machine-and-Human-verified Image-Caption Associations for MS-COCO

naver-ai/eccv-caption 7 Apr 2022

Image-Text matching (ITM) is a common task for evaluating the quality of Vision and Language (VL) models.

Dissecting Deep Metric Learning Losses for Image-Text Retrieval

littleredxh/vse-gradient 21 Oct 2022

In the event that the gradients are not integrable to a valid loss function, we implement our proposed objectives such that they would directly operate in the gradient space instead of on the losses in the embedding space.

Self-supervised vision-language pretraining for Medical visual question answering

pengfeiliheu/m2i2 24 Nov 2022

Medical image visual question answering (VQA) is a task to answer clinical questions, given a radiographic image, which is a challenging problem that requires a model to integrate both vision and language information.

A Differentiable Semantic Metric Approximation in Probabilistic Embedding for Cross-Modal Retrieval

VL-Group/2022-NeurIPS-DAA NeurIPS 2022 2022

To verify the effectiveness of our approach, extensive experiments are conducted on MS-COCO, CUB Captions, and Flickr30K, which are commonly used in cross-modal retrieval.

Structure-CLIP: Towards Scene Graph Knowledge to Enhance Multi-modal Structured Representations

zjukg/structure-clip 6 May 2023

In this paper, we present an end-to-end framework Structure-CLIP, which integrates Scene Graph Knowledge (SGK) to enhance multi-modal structured representations.

Learning Two-Branch Neural Networks for Image-Text Matching Tasks

BryanPlummer/cite 11 Apr 2017

Image-language matching tasks have recently attracted a lot of attention in the computer vision field.

Deep Cross-Modal Projection Learning for Image-Text Matching

YingZhangDUT/Cross-Modal-Projection-Learning ECCV 2018

The key point of image-text matching is how to accurately measure the similarity between visual and textual inputs.

Position Focused Attention Network for Image-Text Matching

HaoYang0123/Position-Focused-Attention-Network 23 Jul 2019

Then, an attention mechanism is proposed to model the relations between the image region and blocks and generate the valuable position feature, which will be further utilized to enhance the region expression and model a more reliable relationship between the visual image and the textual sentence.

Learning fragment self-attention embeddings for image-text matching

yiling2018/saem ACMMM 2019

In this paper, we propose Self-Attention Embeddings (SAEM) to exploit fragment relations in images or texts by self-attention mechanism, and aggregate fragment information into visual and textual embeddings.