Image Captioning

600 papers with code • 31 benchmarks • 64 datasets

Image Captioning is the task of describing the content of an image in words. This task lies at the intersection of computer vision and natural language processing. Most image captioning systems use an encoder-decoder framework, where an input image is encoded into an intermediate representation of the information in the image, and then decoded into a descriptive text sequence. The most popular benchmarks are nocaps and COCO, and models are typically evaluated according to a BLEU or CIDER metric.

( Image credit: Reflective Decoding Network for Image Captioning, ICCV'19)

Libraries

Use these libraries to find Image Captioning models and implementations
4 papers
8,501
3 papers
2,304
See all 8 libraries.

Latest papers with no code

The Solution for the ICCV 2023 1st Scientific Figure Captioning Challenge

no code yet • 26 Mar 2024

In this paper, we propose a solution for improving the quality of captions generated for figures in papers.

Visual Hallucination: Definition, Quantification, and Prescriptive Remediations

no code yet • 26 Mar 2024

The troubling rise of hallucination presents perhaps the most significant impediment to the advancement of responsible AI.

Semi-Supervised Image Captioning Considering Wasserstein Graph Matching

no code yet • 26 Mar 2024

Image captioning can automatically generate captions for the given images, and the key challenge is to learn a mapping function from visual features to natural language features.

Automated Report Generation for Lung Cytological Images Using a CNN Vision Classifier and Multiple-Transformer Text Decoders: Preliminary Study

no code yet • 26 Mar 2024

Independent text decoders for benign and malignant cells are prepared for text generation, and the text decoder switches according to the CNN classification results.

Image Captioning in news report scenario

no code yet • 24 Mar 2024

Image captioning strives to generate pertinent captions for specified images, situating itself at the crossroads of Computer Vision (CV) and Natural Language Processing (NLP).

Cognitive resilience: Unraveling the proficiency of image-captioning models to interpret masked visual content

no code yet • 23 Mar 2024

This study explores the ability of Image Captioning (IC) models to decode masked visual content sourced from diverse datasets.

MyVLM: Personalizing VLMs for User-Specific Queries

no code yet • 21 Mar 2024

To effectively recognize a variety of user-specific concepts, we augment the VLM with external concept heads that function as toggles for the model, enabling the VLM to identify the presence of specific target concepts in a given image.

Improved Baselines for Data-efficient Perceptual Augmentation of LLMs

no code yet • 20 Mar 2024

The abilities of large language models (LLMs) have recently progressed to unprecedented levels, paving the way to novel applications in a wide variety of areas.

As Firm As Their Foundations: Can open-sourced foundation models be used to create adversarial examples for downstream tasks?

no code yet • 19 Mar 2024

Foundation models pre-trained on web-scale vision-language data, such as CLIP, are widely used as cornerstones of powerful machine learning systems.

Boosting Transferability in Vision-Language Attacks via Diversification along the Intersection Region of Adversarial Trajectory

no code yet • 19 Mar 2024

Vision-language pre-training (VLP) models exhibit remarkable capabilities in comprehending both images and text, yet they remain susceptible to multimodal adversarial examples (AEs).