TASK	DATASET	MODEL	METRIC NAME	METRIC VALUE	GLOBAL RANK
Visual Reasoning	Winoground	LDM-T5 (SelfEval)	Text Score	29.00	# 74
Visual Reasoning	Winoground	LDM-T5 (SelfEval)	Image Score	13.50	# 78
Visual Reasoning	Winoground	PDM-T5 (SelfEval)	Text Score	28.25	# 76
Visual Reasoning	Winoground	PDM-T5 (SelfEval)	Image Score	12.00	# 84
Visual Reasoning	Winoground	PDM-CLIP (SelfEval)	Text Score	17.00	# 106
Visual Reasoning	Winoground	PDM-CLIP (SelfEval)	Image Score	14.00	# 73
Visual Reasoning	Winoground	LDM-CLIP (SelfEval)	Text Score	22.75	# 92
Visual Reasoning	Winoground	LDM-CLIP (SelfEval)	Image Score	7.25	# 100
Visual Reasoning	Winoground	OCLIP (ViT-H/14)	Text Score	30.75	# 62
Visual Reasoning	Winoground	OCLIP (ViT-H/14)	Image Score	12.75	# 82
Visual Reasoning	Winoground	CLIP (ViT-L/14)	Text Score	30.25	# 67
Visual Reasoning	Winoground	CLIP (ViT-L/14)	Image Score	8.0	# 96

Badge	Markdown
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/selfeval-leveraging-the-discriminative-nature/visual-reasoning-on-winoground)](https://paperswithcode.com/sota/visual-reasoning-on-winoground?p=selfeval-leveraging-the-discriminative-nature)`

SelfEval: Leveraging the discriminative nature of generative models for evaluation

17 Nov 2023 · Sai Saketh Rambhatla, Ishan Misra ·

In this work, we show that text-to-image generative models can be 'inverted' to assess their own text-image understanding capabilities in a completely automated manner. Our method, called SelfEval, uses the generative model to compute the likelihood of real images given text prompts, making the generative model directly applicable to discriminative tasks. Using SelfEval, we repurpose standard datasets created for evaluating multimodal text-image discriminative models to evaluate generative models in a fine-grained manner: assessing their performance on attribute binding, color recognition, counting, shape recognition, spatial understanding. To the best of our knowledge SelfEval is the first automated metric to show a high degree of agreement for measuring text-faithfulness with the gold-standard human evaluations across multiple models and benchmarks. Moreover, SelfEval enables us to evaluate generative models on challenging tasks such as Winoground image-score where they demonstrate competitive performance to discriminative models. We also show severe drawbacks of standard automated metrics such as CLIP-score to measure text faithfulness on benchmarks such as DrawBench, and how SelfEval sidesteps these issues. We hope SelfEval enables easy and reliable automated evaluation for diffusion models.

PDF Abstract

Code

Add Remove Mark official

No code implementations yet. Submit your code now

Tasks

Add Remove

Attribute

Visual Reasoning

Datasets

Winoground

Results from the Paper

Add Remove

Ranked #62 on Visual Reasoning on Winoground

Get a GitHub badge

Task	Dataset	Model	Metric Name	Metric Value	Global Rank	Benchmark
Visual Reasoning	Winoground	LDM-T5 (SelfEval)	Text Score	29.00	# 74	Compare
Visual Reasoning	Winoground	LDM-T5 (SelfEval)	Image Score	13.50	# 78	Compare
Visual Reasoning	Winoground	PDM-T5 (SelfEval)	Text Score	28.25	# 76	Compare
Visual Reasoning	Winoground	PDM-T5 (SelfEval)	Image Score	12.00	# 84	Compare
Visual Reasoning	Winoground	PDM-CLIP (SelfEval)	Text Score	17.00	# 106	Compare
Visual Reasoning	Winoground	PDM-CLIP (SelfEval)	Image Score	14.00	# 73	Compare
Visual Reasoning	Winoground	LDM-CLIP (SelfEval)	Text Score	22.75	# 92	Compare
Visual Reasoning	Winoground	LDM-CLIP (SelfEval)	Image Score	7.25	# 100	Compare
Visual Reasoning	Winoground	OCLIP (ViT-H/14)	Text Score	30.75	# 62	Compare
Visual Reasoning	Winoground	OCLIP (ViT-H/14)	Image Score	12.75	# 82	Compare
Visual Reasoning	Winoground	CLIP (ViT-L/14)	Text Score	30.25	# 67	Compare
Visual Reasoning	Winoground	CLIP (ViT-L/14)	Image Score	8.0	# 96	Compare

Methods

Add Remove

Diffusion

Edit Social Preview

SelfEval: Leveraging the discriminative nature of generative models for evaluation

Code Edit Add Remove Mark official

Tasks Edit Add Remove

Datasets Edit

Results from the Paper Edit Add Remove

Methods Edit Add Remove

Code

Add Remove Mark official

Tasks

Add Remove

Datasets

Results from the Paper

Add Remove

Methods

Add Remove