Image Captioning
598 papers with code • 31 benchmarks • 64 datasets
Image Captioning is the task of describing the content of an image in words. This task lies at the intersection of computer vision and natural language processing. Most image captioning systems use an encoder-decoder framework, where an input image is encoded into an intermediate representation of the information in the image, and then decoded into a descriptive text sequence. The most popular benchmarks are nocaps and COCO, and models are typically evaluated according to a BLEU or CIDER metric.
( Image credit: Reflective Decoding Network for Image Captioning, ICCV'19)
Libraries
Use these libraries to find Image Captioning models and implementationsDatasets
Subtasks
Latest papers
Can Language Beat Numerical Regression? Language-Based Multimodal Trajectory Prediction
Next, to guide the language model in understanding and reasoning high-level knowledge, such as scene context and social relationships between pedestrians, we introduce an auxiliary multi-task question and answering.
ECoDepth: Effective Conditioning of Diffusion Models for Monocular Depth Estimation
We argue that the embedding vector from a ViT model, pre-trained on a large dataset, captures greater relevant information for SIDE than the usual route of generating pseudo image captions, followed by CLIP based text embeddings.
VL-ICL Bench: The Devil in the Details of Benchmarking Multimodal In-Context Learning
Built on top of LLMs, vision large language models (VLLMs) have advanced significantly in areas such as recognition, reasoning, and grounding.
Are Vision Language Models Texture or Shape Biased and Can We Steer Them?
If text does indeed influence visual biases, this suggests that we may be able to steer visual biases not just through visual input but also through language: a hypothesis that we confirm through extensive experiments.
Beyond Text: Frozen Large Language Models in Visual Signal Comprehension
To achieve this, we present the Vision-to-Language Tokenizer, abbreviated as V2T Tokenizer, which transforms an image into a ``foreign language'' with the combined aid of an encoder-decoder, the LLM vocabulary, and a CLIP model.
MeaCap: Memory-Augmented Zero-shot Image Captioning
The framework of MeaCap achieves the state-of-the-art performance on a series of zero-shot IC settings.
VTG-GPT: Tuning-Free Zero-Shot Video Temporal Grounding with GPT
Video temporal grounding (VTG) aims to locate specific temporal segments from an untrimmed video based on a linguistic query.
What Is Missing in Multilingual Visual Reasoning and How to Fix It
NLP models today strive for supporting multiple languages and modalities, improving accessibility for diverse users.
Improving Explicit Spatial Relationships in Text-to-Image Generation through an Automatically Derived Dataset
We hypothesize that this is because explicit spatial relations rarely appear in the image captions used to train these models.
Polos: Multimodal Metric Learning from Human Feedback for Image Captioning
Establishing an automatic evaluation metric that closely aligns with human judgments is essential for effectively developing image captioning models.