Image Captioning

598 papers with code • 31 benchmarks • 64 datasets

Image Captioning is the task of describing the content of an image in words. This task lies at the intersection of computer vision and natural language processing. Most image captioning systems use an encoder-decoder framework, where an input image is encoded into an intermediate representation of the information in the image, and then decoded into a descriptive text sequence. The most popular benchmarks are nocaps and COCO, and models are typically evaluated according to a BLEU or CIDER metric.

( Image credit: Reflective Decoding Network for Image Captioning, ICCV'19)

Libraries

Use these libraries to find Image Captioning models and implementations
4 papers
8,501
3 papers
2,304
See all 8 libraries.

Can Language Beat Numerical Regression? Language-Based Multimodal Trajectory Prediction

inhwanbae/lmtrajectory 27 Mar 2024

Next, to guide the language model in understanding and reasoning high-level knowledge, such as scene context and social relationships between pedestrians, we introduce an auxiliary multi-task question and answering.

20
27 Mar 2024

ECoDepth: Effective Conditioning of Diffusion Models for Monocular Depth Estimation

aradhye2002/ecodepth 27 Mar 2024

We argue that the embedding vector from a ViT model, pre-trained on a large dataset, captures greater relevant information for SIDE than the usual route of generating pseudo image captions, followed by CLIP based text embeddings.

1
27 Mar 2024

VL-ICL Bench: The Devil in the Details of Benchmarking Multimodal In-Context Learning

ys-zong/vl-icl 19 Mar 2024

Built on top of LLMs, vision large language models (VLLMs) have advanced significantly in areas such as recognition, reasoning, and grounding.

4
19 Mar 2024

Are Vision Language Models Texture or Shape Biased and Can We Steer Them?

paulgavrikov/vlm_shapebias 14 Mar 2024

If text does indeed influence visual biases, this suggests that we may be able to steer visual biases not just through visual input but also through language: a hypothesis that we confirm through extensive experiments.

3
14 Mar 2024

Beyond Text: Frozen Large Language Models in Visual Signal Comprehension

zh460045050/v2l-tokenizer 12 Mar 2024

To achieve this, we present the Vision-to-Language Tokenizer, abbreviated as V2T Tokenizer, which transforms an image into a ``foreign language'' with the combined aid of an encoder-decoder, the LLM vocabulary, and a CLIP model.

41
12 Mar 2024

MeaCap: Memory-Augmented Zero-shot Image Captioning

joeyz0z/meacap 6 Mar 2024

The framework of MeaCap achieves the state-of-the-art performance on a series of zero-shot IC settings.

5
06 Mar 2024

VTG-GPT: Tuning-Free Zero-Shot Video Temporal Grounding with GPT

YoucanBaby/VTG-GPT Applied Sciences 2024

Video temporal grounding (VTG) aims to locate specific temporal segments from an untrimmed video based on a linguistic query.

64
04 Mar 2024

What Is Missing in Multilingual Visual Reasoning and How to Fix It

yueqis/multilingual_visual_reasoning 3 Mar 2024

NLP models today strive for supporting multiple languages and modalities, improving accessibility for diverse users.

0
03 Mar 2024

Improving Explicit Spatial Relationships in Text-to-Image Generation through an Automatically Derived Dataset

salanueva/sr4g 1 Mar 2024

We hypothesize that this is because explicit spatial relations rarely appear in the image captions used to train these models.

0
01 Mar 2024

Polos: Multimodal Metric Learning from Human Feedback for Image Captioning

keio-smilab24/Polos 28 Feb 2024

Establishing an automatic evaluation metric that closely aligns with human judgments is essential for effectively developing image captioning models.

3
28 Feb 2024