Image Captioning

614 papers with code • 32 benchmarks • 65 datasets

Image Captioning is the task of describing the content of an image in words. This task lies at the intersection of computer vision and natural language processing. Most image captioning systems use an encoder-decoder framework, where an input image is encoded into an intermediate representation of the information in the image, and then decoded into a descriptive text sequence. The most popular benchmarks are nocaps and COCO, and models are typically evaluated according to a BLEU or CIDER metric.

( Image credit: Reflective Decoding Network for Image Captioning, ICCV'19)

Libraries

Use these libraries to find Image Captioning models and implementations
4 papers
8,722
3 papers
2,323
See all 8 libraries.

Most implemented papers

BERTScore: Evaluating Text Generation with BERT

Tiiiger/bert_score ICLR 2020

We propose BERTScore, an automatic evaluation metric for text generation.

BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models

salesforce/lavis 30 Jan 2023

The cost of vision-and-language pre-training has become increasingly prohibitive due to end-to-end training of large-scale models.

MS-Celeb-1M: A Dataset and Benchmark for Large-Scale Face Recognition

deepinsight/insightface 27 Jul 2016

In this paper, we design a benchmark task and provide the associated datasets for recognizing face images and link them to corresponding entity keys in a knowledge base.

SPICE: Semantic Propositional Image Caption Evaluation

peteanderson80/SPICE 29 Jul 2016

There is considerable interest in the task of automatically generating image captions.

RISE: Randomized Input Sampling for Explanation of Black-box Models

eclique/RISE 19 Jun 2018

We compare our approach to state-of-the-art importance extraction methods using both an automatic deletion/insertion metric and a pointing metric based on human-annotated object segments.

COCO-Stuff: Thing and Stuff Classes in Context

nightrome/cocostuff CVPR 2018

To understand stuff and things in context we introduce COCO-Stuff, which augments all 164K images of the COCO 2017 dataset with pixel-wise annotations for 91 stuff classes.

Scheduled Sampling for Sequence Prediction with Recurrent Neural Networks

theamrzaki/text_summurization_abstractive_methods NeurIPS 2015

Recurrent Neural Networks can be trained to produce sequences of tokens given some input, as exemplified by recent results in machine translation and image captioning.

A neural attention model for speech command recognition

douglas125/SpeechCmdRecognition 27 Aug 2018

This paper introduces a convolutional recurrent network with attention for speech command recognition.

Knowing When to Look: Adaptive Attention via A Visual Sentinel for Image Captioning

jiasenlu/AdaptiveAttention CVPR 2017

The model decides whether to attend to the image and where, in order to extract meaningful information for sequential word generation.

VinVL: Revisiting Visual Representations in Vision-Language Models

pzzhang/VinVL CVPR 2021

In our experiments we feed the visual features generated by the new object detection model into a Transformer-based VL fusion model \oscar \cite{li2020oscar}, and utilize an improved approach \short\ to pre-train the VL model and fine-tune it on a wide range of downstream VL tasks.