TASK	DATASET	MODEL	METRIC NAME	METRIC VALUE	GLOBAL RANK
Image Captioning	COCO Captions	CapDec	BLEU-4	26.4	# 31
Image Captioning	COCO Captions	CapDec	METEOR	25.1	# 26
Image Captioning	COCO Captions	CapDec	CIDER	91.8	# 32
Semi Supervised Learning for Image Captioning	Flickr30k	CapDec	CIDEr	39.1	# 1
Image Captioning	FlickrStyle10K	CapDec	BLEU-1 (Romantic)	29.4	# 1
Semi Supervised Learning for Image Captioning	FlickrStyle10K	CapDec	CIDEr	30.0	# 1
Image Captioning	MSCOCO	CapDec	BLEU-4	26.4	# 1

Badge	Markdown
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/text-only-training-for-image-captioning-using/semi-supervised-learning-for-image-captioning-2)](https://paperswithcode.com/sota/semi-supervised-learning-for-image-captioning-2?p=text-only-training-for-image-captioning-using)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/text-only-training-for-image-captioning-using/image-captioning-on-flickrstyle10k)](https://paperswithcode.com/sota/image-captioning-on-flickrstyle10k?p=text-only-training-for-image-captioning-using)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/text-only-training-for-image-captioning-using/semi-supervised-learning-for-image-captioning-3)](https://paperswithcode.com/sota/semi-supervised-learning-for-image-captioning-3?p=text-only-training-for-image-captioning-using)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/text-only-training-for-image-captioning-using/image-captioning-on-mscoco-1)](https://paperswithcode.com/sota/image-captioning-on-mscoco-1?p=text-only-training-for-image-captioning-using)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/text-only-training-for-image-captioning-using/image-captioning-on-coco-captions)](https://paperswithcode.com/sota/image-captioning-on-coco-captions?p=text-only-training-for-image-captioning-using)`

Text-Only Training for Image Captioning using Noise-Injected CLIP

1 Nov 2022 · David Nukrai, Ron Mokady, Amir Globerson ·

We consider the task of image-captioning using only the CLIP model and additional text data at training time, and no additional captioned images. Our approach relies on the fact that CLIP is trained to make visual and textual embeddings similar. Therefore, we only need to learn how to translate CLIP textual embeddings back into text, and we can learn how to do this by learning a decoder for the frozen CLIP text encoder using only text. We argue that this intuition is "almost correct" because of a gap between the embedding spaces, and propose to rectify this via noise injection during training. We demonstrate the effectiveness of our approach by showing SOTA zero-shot image captioning across four benchmarks, including style transfer. Code, data, and models are available on GitHub.

PDF Abstract

Code

Add Remove Mark official

davidhuji/capdec official

↳ Quickstart in

Colab

169

zelaki/wsac

Tasks

Add Remove

Image Captioning

Language Modelling

Semi Supervised Learning for Image Captioning

Datasets

MS COCO

Flickr30k

COCO Captions MSCOCO

FlickrStyle10K

Results from the Paper

Add Remove

Ranked #1 on Image Captioning on MSCOCO

Get a GitHub badge

Task	Dataset	Model	Metric Name	Metric Value	Global Rank	Benchmark
Image Captioning	COCO Captions	CapDec	BLEU-4	26.4	# 31	Compare
			METEOR	25.1	# 26	Compare
			CIDER	91.8	# 32	Compare
Semi Supervised Learning for Image Captioning	Flickr30k	CapDec	CIDEr	39.1	# 1	Compare
Image Captioning	FlickrStyle10K	CapDec	BLEU-1 (Romantic)	29.4	# 1	Compare
Semi Supervised Learning for Image Captioning	FlickrStyle10K	CapDec	CIDEr	30.0	# 1	Compare
Image Captioning	MSCOCO	CapDec	BLEU-4	26.4	# 1	Compare

Methods

Add Remove

Adam • Attention Dropout • BPE • CLIP • Cosine Annealing • Dense Connections • Discriminative Fine-Tuning • Dropout • GELU • GPT-2 • Layer Normalization • Linear Layer • Linear Warmup With Cosine Annealing • Multi-Head Attention • Residual Connection • Scaled Dot-Product Attention • Softmax • Weight Decay

Edit Social Preview

Text-Only Training for Image Captioning using Noise-Injected CLIP

Code Edit Add Remove Mark official

Tasks Edit Add Remove

Datasets Edit

Results from the Paper Edit Add Remove

Methods Edit Add Remove

Code

Add Remove Mark official

Tasks

Add Remove

Datasets

Results from the Paper

Add Remove

Methods

Add Remove