Image Captioning

614 papers with code • 32 benchmarks • 65 datasets

Image Captioning is the task of describing the content of an image in words. This task lies at the intersection of computer vision and natural language processing. Most image captioning systems use an encoder-decoder framework, where an input image is encoded into an intermediate representation of the information in the image, and then decoded into a descriptive text sequence. The most popular benchmarks are nocaps and COCO, and models are typically evaluated according to a BLEU or CIDER metric.

( Image credit: Reflective Decoding Network for Image Captioning, ICCV'19)

Benchmarks

Add a Result

These leaderboards are used to track progress in Image Captioning

Dataset	Best Model	Compare
COCO Captions	mPLUG	See all
MS COCO	ExpansionNet v2	See all
nocaps-val-in-domain	BLIP-2 ViT-G FlanT5 XL (zero-shot)	See all
nocaps-val-overall	BLIP-2 ViT-G FlanT5 XL (zero-shot)	See all
nocaps in-domain	GIT2, Single Model	See all
nocaps-val-near-domain	BLIP-2 ViT-G FlanT5 XL (zero-shot)	See all
nocaps-val-out-domain	BLIP-2 ViT-G FlanT5 XL (zero-shot)	See all
nocaps near-domain	GIT2, Single Model	See all
nocaps out-of-domain	PaLI	See all
SCICAP	CNN+LSTM (Vision only, First sentence)	See all
nocaps entire	Lyrics	See all
Flickr30k Captions test	Unified VLP	See all
WHOOPS! A Vision-and-Language Benchmark of Synthetic and Compositional Images	BLIP2 FlanT5-XXL (Fine-tuned)	See all
nocaps-XD entire	GIT2	See all
nocaps-XD in-domain	GIT2	See all
nocaps-XD near-domain	GIT2	See all
nocaps-XD out-of-domain	GIT2	See all
nocaps val	Prismer	See all
COCO Captions test	From Captions to Visual Concepts and Back	See all
Localized Narratives	LoopCAG	See all
FlickrStyle10K	CapDec	See all
Conceptual Captions	ClipCap (MLP + GPT2 tuning)	See all
BanglaLekhaImageCaptions	CNN + 1D CNN	See all
AIC-ICC	CMCL	See all
MSCOCO	CapDec	See all
IU X-Ray	BiomedGPT	See all
Peir Gross	BiomedGPT	See all
MS-COCO	NeuSyRE	See all
ChEBI-20	GIT-Mol	See all
VizWiz 2020 test-dev	IBM Research AI	See all
VizWiz 2020 test	IBM Research AI	See all
TextCaps 2020	TAP	See all

Show all 32 benchmarks

Collapse benchmarks

Libraries

Use these libraries to find Image Captioning models and implementations

huggingface/transformers

5 papers

124,889

salesforce/lavis

4 papers

8,722

ofa-sys/ofa

3 papers

2,323

google-research/big_vision

3 papers

1,552

See all 8 libraries.

Datasets

Subtasks

Semi Supervised Learning for Image Captioning

Aesthetic Image Captioning

Vietnamese Image Captioning

Hindi Image Captioning

Most implemented papers

Most implemented Social Latest No code

BERTScore: Evaluating Text Generation with BERT

Tiiiger/bert_score • • ICLR 2020

We propose BERTScore, an automatic evaluation metric for text generation.

Paper
Code

BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models

salesforce/lavis • • 30 Jan 2023

The cost of vision-and-language pre-training has become increasingly prohibitive due to end-to-end training of large-scale models.

Paper
Code

MS-Celeb-1M: A Dataset and Benchmark for Large-Scale Face Recognition

deepinsight/insightface • • 27 Jul 2016

In this paper, we design a benchmark task and provide the associated datasets for recognizing face images and link them to corresponding entity keys in a knowledge base.

Paper
Code

SPICE: Semantic Propositional Image Caption Evaluation

peteanderson80/SPICE • 29 Jul 2016

There is considerable interest in the task of automatically generating image captions.

Paper
Code

RISE: Randomized Input Sampling for Explanation of Black-box Models

eclique/RISE • • 19 Jun 2018

We compare our approach to state-of-the-art importance extraction methods using both an automatic deletion/insertion metric and a pointing metric based on human-annotated object segments.

Paper
Code

COCO-Stuff: Thing and Stuff Classes in Context

nightrome/cocostuff • • CVPR 2018

To understand stuff and things in context we introduce COCO-Stuff, which augments all 164K images of the COCO 2017 dataset with pixel-wise annotations for 91 stuff classes.

Paper
Code

Scheduled Sampling for Sequence Prediction with Recurrent Neural Networks

theamrzaki/text_summurization_abstractive_methods • • NeurIPS 2015

Recurrent Neural Networks can be trained to produce sequences of tokens given some input, as exemplified by recent results in machine translation and image captioning.

Paper
Code

A neural attention model for speech command recognition

douglas125/SpeechCmdRecognition • • 27 Aug 2018

This paper introduces a convolutional recurrent network with attention for speech command recognition.

Paper
Code

Knowing When to Look: Adaptive Attention via A Visual Sentinel for Image Captioning

jiasenlu/AdaptiveAttention • • CVPR 2017

The model decides whether to attend to the image and where, in order to extract meaningful information for sequential word generation.

Paper
Code

VinVL: Revisiting Visual Representations in Vision-Language Models

pzzhang/VinVL • CVPR 2021

In our experiments we feed the visual features generated by the new object detection model into a Transformer-based VL fusion model \oscar \cite{li2020oscar}, and utilize an improved approach \short\ to pre-train the VL model and fine-tune it on a wide range of downstream VL tasks.

Paper
Code

Image Captioning

Benchmarks Add a Result

Libraries

Datasets

Subtasks

Most implemented papers

Content

Benchmarks

Add a Result