Image Captioning

618 papers with code • 32 benchmarks • 65 datasets

Image Captioning is the task of describing the content of an image in words. This task lies at the intersection of computer vision and natural language processing. Most image captioning systems use an encoder-decoder framework, where an input image is encoded into an intermediate representation of the information in the image, and then decoded into a descriptive text sequence. The most popular benchmarks are nocaps and COCO, and models are typically evaluated according to a BLEU or CIDER metric.

( Image credit: Reflective Decoding Network for Image Captioning, ICCV'19)

Benchmarks

Add a Result

These leaderboards are used to track progress in Image Captioning

Dataset	Best Model	Compare
COCO Captions	mPLUG	See all
MS COCO	ExpansionNet v2	See all
nocaps-val-in-domain	BLIP-2 ViT-G FlanT5 XL (zero-shot)	See all
nocaps-val-overall	BLIP-2 ViT-G FlanT5 XL (zero-shot)	See all
nocaps in-domain	GIT2, Single Model	See all
nocaps-val-near-domain	BLIP-2 ViT-G FlanT5 XL (zero-shot)	See all
nocaps-val-out-domain	BLIP-2 ViT-G FlanT5 XL (zero-shot)	See all
nocaps near-domain	GIT2, Single Model	See all
nocaps out-of-domain	PaLI	See all
SCICAP	CNN+LSTM (Vision only, First sentence)	See all
nocaps entire	Lyrics	See all
Flickr30k Captions test	Unified VLP	See all
WHOOPS! A Vision-and-Language Benchmark of Synthetic and Compositional Images	BLIP2 FlanT5-XXL (Fine-tuned)	See all
nocaps-XD entire	GIT2	See all
nocaps-XD in-domain	GIT2	See all
nocaps-XD near-domain	GIT2	See all
nocaps-XD out-of-domain	GIT2	See all
nocaps val	Prismer	See all
COCO Captions test	From Captions to Visual Concepts and Back	See all
Localized Narratives	LoopCAG	See all
FlickrStyle10K	CapDec	See all
Conceptual Captions	ClipCap (MLP + GPT2 tuning)	See all
BanglaLekhaImageCaptions	CNN + 1D CNN	See all
AIC-ICC	CMCL	See all
MSCOCO	CapDec	See all
IU X-Ray	BiomedGPT	See all
Peir Gross	BiomedGPT	See all
MS-COCO	NeuSyRE	See all
ChEBI-20	GIT-Mol	See all
VizWiz 2020 test-dev	IBM Research AI	See all
VizWiz 2020 test	IBM Research AI	See all
TextCaps 2020	TAP	See all

Show all 32 benchmarks

Collapse benchmarks

Libraries

Use these libraries to find Image Captioning models and implementations

huggingface/transformers

4 papers

125,118

salesforce/lavis

4 papers

8,732

ofa-sys/ofa

3 papers

2,324

google-research/big_vision

3 papers

1,558

See all 8 libraries.

Datasets

Subtasks

Semi Supervised Learning for Image Captioning

Aesthetic Image Captioning

Vietnamese Image Captioning

Hindi Image Captioning

Latest papers with no code

Most implemented Social Latest No code

LocCa: Visual Pretraining with Location-aware Captioners

no code yet • 28 Mar 2024

In this paper, we propose a simple visual pretraining method with location-aware captioners (LocCa).

Paper
Add Code

Text Data-Centric Image Captioning with Interactive Prompts

no code yet • 28 Mar 2024

Among them, the mainstream solution is to project image embeddings into the text embedding space with the assistance of consistent representations between image-text pairs from the CLIP model.

Paper
Add Code

A Review of Multi-Modal Large Language and Vision Models

no code yet • 28 Mar 2024

Large Language Models (LLMs) have recently emerged as a focal point of research and application, driven by their unprecedented ability to understand and generate text with human-like quality.

Paper
Add Code

A Survey on Large Language Models from Concept to Implementation

no code yet • 27 Mar 2024

Recent advancements in Large Language Models (LLMs), particularly those built on Transformer architectures, have significantly broadened the scope of natural language processing (NLP) applications, transcending their initial use in chatbot technology.

Paper
Add Code

The Solution for the ICCV 2023 1st Scientific Figure Captioning Challenge

no code yet • 26 Mar 2024

In this paper, we propose a solution for improving the quality of captions generated for figures in papers.

Paper
Add Code

Visual Hallucination: Definition, Quantification, and Prescriptive Remediations

no code yet • 26 Mar 2024

The troubling rise of hallucination presents perhaps the most significant impediment to the advancement of responsible AI.

Paper
Add Code

Semi-Supervised Image Captioning Considering Wasserstein Graph Matching

no code yet • 26 Mar 2024

Image captioning can automatically generate captions for the given images, and the key challenge is to learn a mapping function from visual features to natural language features.

Paper
Add Code

Automated Report Generation for Lung Cytological Images Using a CNN Vision Classifier and Multiple-Transformer Text Decoders: Preliminary Study

no code yet • 26 Mar 2024

Independent text decoders for benign and malignant cells are prepared for text generation, and the text decoder switches according to the CNN classification results.

Paper
Add Code

Image Captioning in news report scenario

no code yet • 24 Mar 2024

Image captioning strives to generate pertinent captions for specified images, situating itself at the crossroads of Computer Vision (CV) and Natural Language Processing (NLP).

Paper
Add Code

Cognitive resilience: Unraveling the proficiency of image-captioning models to interpret masked visual content

no code yet • 23 Mar 2024

This study explores the ability of Image Captioning (IC) models to decode masked visual content sourced from diverse datasets.

Paper
Add Code

Image Captioning

Benchmarks Add a Result

Libraries

Datasets

Subtasks

Latest papers with no code

Content

Benchmarks

Add a Result