Image Captioning

619 papers with code • 33 benchmarks • 65 datasets

Image Captioning is the task of describing the content of an image in words. This task lies at the intersection of computer vision and natural language processing. Most image captioning systems use an encoder-decoder framework, where an input image is encoded into an intermediate representation of the information in the image, and then decoded into a descriptive text sequence. The most popular benchmarks are nocaps and COCO, and models are typically evaluated according to a BLEU or CIDER metric.

( Image credit: Reflective Decoding Network for Image Captioning, ICCV'19)

Libraries

Use these libraries to find Image Captioning models and implementations
4 papers
8,774
3 papers
2,326
See all 8 libraries.

Most implemented papers

Delete, Retrieve, Generate: A Simple Approach to Sentiment and Style Transfer

lijuncen/Sentiment-and-Style-Transfer NAACL 2018

We consider the task of text attribute transfer: transforming a sentence to alter a specific attribute (e. g., sentiment) while preserving its attribute-independent content (e. g., changing "screen is just the right size" to "screen is too small").

BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation

salesforce/lavis 28 Jan 2022

Furthermore, performance improvement has been largely achieved by scaling up the dataset with noisy image-text pairs collected from the web, which is a suboptimal source of supervision.

Analog Bits: Generating Discrete Data using Diffusion Models with Self-Conditioning

google-research/pix2seq 8 Aug 2022

The main idea behind our approach is to first represent the discrete data as binary bits, and then train a continuous diffusion model to model these bits as real numbers which we call analog bits.

Attention on Attention for Image Captioning

husthuaan/AoANet ICCV 2019

In this paper, we propose an Attention on Attention (AoA) module, which extends the conventional attention mechanisms to determine the relevance between attention results and queries.

CoCa: Contrastive Captioners are Image-Text Foundation Models

mlfoundations/open_clip 4 May 2022

We apply a contrastive loss between unimodal image and text embeddings, in addition to a captioning loss on the multimodal decoder outputs which predicts text tokens autoregressively.

Learning a Deep Embedding Model for Zero-Shot Learning

lzrobots/DeepEmbeddingModel_ZSL CVPR 2017

In this paper we argue that the key to make deep ZSL models succeed is to choose the right embedding space.

Evolving Deep Neural Networks

sbcblab/Keras-CoDeepNEAT 1 Mar 2017

The success of deep learning depends on finding an architecture to fit the task.

Bayesian Recurrent Neural Networks

mirceamironenco/BayesianRecurrentNN 10 Apr 2017

We also empirically demonstrate how Bayesian RNNs are superior to traditional RNNs on a language modelling benchmark and an image captioning task, as well as showing how each of these methods improve our model over a variety of other schemes for training them.

What is the Role of Recurrent Neural Networks (RNNs) in an Image Caption Generator?

mtanti/rnn-role WS 2017

This view suggests that the RNN should only be used to encode linguistic features and that only the final representation should be `merged' with the image features at a later stage.

Convolutional Image Captioning

aditya12agd5/convcap CVPR 2018

In recent years significant progress has been made in image captioning, using Recurrent Neural Networks powered by long-short-term-memory (LSTM) units.