Image-text matching
84 papers with code • 1 benchmarks • 1 datasets
Libraries
Use these libraries to find Image-text matching models and implementationsMost implemented papers
Visual Semantic Reasoning for Image-Text Matching
It outperforms the current best method by 6. 8% relatively for image retrieval and 4. 8% relatively for caption retrieval on MS-COCO (Recall@1 using 1K test set).
ECCV Caption: Correcting False Negatives by Collecting Machine-and-Human-verified Image-Caption Associations for MS-COCO
Image-Text matching (ITM) is a common task for evaluating the quality of Vision and Language (VL) models.
Dissecting Deep Metric Learning Losses for Image-Text Retrieval
In the event that the gradients are not integrable to a valid loss function, we implement our proposed objectives such that they would directly operate in the gradient space instead of on the losses in the embedding space.
Self-supervised vision-language pretraining for Medical visual question answering
Medical image visual question answering (VQA) is a task to answer clinical questions, given a radiographic image, which is a challenging problem that requires a model to integrate both vision and language information.
A Differentiable Semantic Metric Approximation in Probabilistic Embedding for Cross-Modal Retrieval
To verify the effectiveness of our approach, extensive experiments are conducted on MS-COCO, CUB Captions, and Flickr30K, which are commonly used in cross-modal retrieval.
Structure-CLIP: Towards Scene Graph Knowledge to Enhance Multi-modal Structured Representations
In this paper, we present an end-to-end framework Structure-CLIP, which integrates Scene Graph Knowledge (SGK) to enhance multi-modal structured representations.
Learning Two-Branch Neural Networks for Image-Text Matching Tasks
Image-language matching tasks have recently attracted a lot of attention in the computer vision field.
Deep Cross-Modal Projection Learning for Image-Text Matching
The key point of image-text matching is how to accurately measure the similarity between visual and textual inputs.
Position Focused Attention Network for Image-Text Matching
Then, an attention mechanism is proposed to model the relations between the image region and blocks and generate the valuable position feature, which will be further utilized to enhance the region expression and model a more reliable relationship between the visual image and the textual sentence.
Learning fragment self-attention embeddings for image-text matching
In this paper, we propose Self-Attention Embeddings (SAEM) to exploit fragment relations in images or texts by self-attention mechanism, and aggregate fragment information into visual and textual embeddings.