Image Captioning
619 papers with code • 33 benchmarks • 65 datasets
Image Captioning is the task of describing the content of an image in words. This task lies at the intersection of computer vision and natural language processing. Most image captioning systems use an encoder-decoder framework, where an input image is encoded into an intermediate representation of the information in the image, and then decoded into a descriptive text sequence. The most popular benchmarks are nocaps and COCO, and models are typically evaluated according to a BLEU or CIDER metric.
( Image credit: Reflective Decoding Network for Image Captioning, ICCV'19)
Libraries
Use these libraries to find Image Captioning models and implementationsDatasets
Subtasks
Most implemented papers
Delete, Retrieve, Generate: A Simple Approach to Sentiment and Style Transfer
We consider the task of text attribute transfer: transforming a sentence to alter a specific attribute (e. g., sentiment) while preserving its attribute-independent content (e. g., changing "screen is just the right size" to "screen is too small").
BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation
Furthermore, performance improvement has been largely achieved by scaling up the dataset with noisy image-text pairs collected from the web, which is a suboptimal source of supervision.
Analog Bits: Generating Discrete Data using Diffusion Models with Self-Conditioning
The main idea behind our approach is to first represent the discrete data as binary bits, and then train a continuous diffusion model to model these bits as real numbers which we call analog bits.
Attention on Attention for Image Captioning
In this paper, we propose an Attention on Attention (AoA) module, which extends the conventional attention mechanisms to determine the relevance between attention results and queries.
CoCa: Contrastive Captioners are Image-Text Foundation Models
We apply a contrastive loss between unimodal image and text embeddings, in addition to a captioning loss on the multimodal decoder outputs which predicts text tokens autoregressively.
Learning a Deep Embedding Model for Zero-Shot Learning
In this paper we argue that the key to make deep ZSL models succeed is to choose the right embedding space.
Evolving Deep Neural Networks
The success of deep learning depends on finding an architecture to fit the task.
Bayesian Recurrent Neural Networks
We also empirically demonstrate how Bayesian RNNs are superior to traditional RNNs on a language modelling benchmark and an image captioning task, as well as showing how each of these methods improve our model over a variety of other schemes for training them.
What is the Role of Recurrent Neural Networks (RNNs) in an Image Caption Generator?
This view suggests that the RNN should only be used to encode linguistic features and that only the final representation should be `merged' with the image features at a later stage.
Convolutional Image Captioning
In recent years significant progress has been made in image captioning, using Recurrent Neural Networks powered by long-short-term-memory (LSTM) units.