Image Captioning
614 papers with code • 32 benchmarks • 65 datasets
Image Captioning is the task of describing the content of an image in words. This task lies at the intersection of computer vision and natural language processing. Most image captioning systems use an encoder-decoder framework, where an input image is encoded into an intermediate representation of the information in the image, and then decoded into a descriptive text sequence. The most popular benchmarks are nocaps and COCO, and models are typically evaluated according to a BLEU or CIDER metric.
( Image credit: Reflective Decoding Network for Image Captioning, ICCV'19)
Libraries
Use these libraries to find Image Captioning models and implementationsDatasets
Subtasks
Most implemented papers
BERTScore: Evaluating Text Generation with BERT
We propose BERTScore, an automatic evaluation metric for text generation.
BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models
The cost of vision-and-language pre-training has become increasingly prohibitive due to end-to-end training of large-scale models.
MS-Celeb-1M: A Dataset and Benchmark for Large-Scale Face Recognition
In this paper, we design a benchmark task and provide the associated datasets for recognizing face images and link them to corresponding entity keys in a knowledge base.
SPICE: Semantic Propositional Image Caption Evaluation
There is considerable interest in the task of automatically generating image captions.
RISE: Randomized Input Sampling for Explanation of Black-box Models
We compare our approach to state-of-the-art importance extraction methods using both an automatic deletion/insertion metric and a pointing metric based on human-annotated object segments.
COCO-Stuff: Thing and Stuff Classes in Context
To understand stuff and things in context we introduce COCO-Stuff, which augments all 164K images of the COCO 2017 dataset with pixel-wise annotations for 91 stuff classes.
Scheduled Sampling for Sequence Prediction with Recurrent Neural Networks
Recurrent Neural Networks can be trained to produce sequences of tokens given some input, as exemplified by recent results in machine translation and image captioning.
A neural attention model for speech command recognition
This paper introduces a convolutional recurrent network with attention for speech command recognition.
Knowing When to Look: Adaptive Attention via A Visual Sentinel for Image Captioning
The model decides whether to attend to the image and where, in order to extract meaningful information for sequential word generation.
VinVL: Revisiting Visual Representations in Vision-Language Models
In our experiments we feed the visual features generated by the new object detection model into a Transformer-based VL fusion model \oscar \cite{li2020oscar}, and utilize an improved approach \short\ to pre-train the VL model and fine-tune it on a wide range of downstream VL tasks.