Dialogue Evaluation

48 papers with code • 2 benchmarks • 6 datasets

This task has no description! Would you like to contribute one?

Benchmarks

Add a Result

These leaderboards are used to track progress in Dialogue Evaluation

Trend	Dataset	Best Model	Paper	Code	Compare
	USR-TopicalChat	MDD-Eval			See all
	USR-PersonaChat	Lin-Reg (all)			See all

Datasets

Most implemented papers

Most implemented Social Latest No code

Evaluating Coherence in Dialogue Systems using Entailment

nouhadziri/DialogEntailment • • NAACL 2019

Evaluating open-domain dialogue systems is difficult due to the diversity of possible correct answers.

Paper
Code

Towards Best Experiment Design for Evaluating Dialogue System Output

sashank06/INLG_eval • WS 2019

To overcome the limitations of automated metrics (e. g. BLEU, METEOR) for evaluating dialogue systems, researchers typically use human judgments to provide convergent evidence.

Paper
Code

PONE: A Novel Automatic Evaluation Metric for Open-Domain Generative Dialogue Systems

gmftbyGMFTBY/PONE • • 6 Apr 2020

Through extensive experiments, the learning-based metrics are demonstrated that they are the most effective evaluation metrics for open-domain generative dialogue systems.

Paper
Code

USR: An Unsupervised and Reference Free Evaluation Metric for Dialog Generation

shikib/usr • • ACL 2020

The lack of meaningful automatic evaluation metrics for dialog has impeded open-domain dialog research.

Paper
Code

Learning an Unreferenced Metric for Online Dialogue Evaluation

facebookresearch/online_dialog_eval • • ACL 2020

Evaluating the quality of a dialogue interaction between two agents is a difficult task, especially in open-domain chit-chat style dialogue.

Paper
Code

Towards Holistic and Automatic Evaluation of Open-Domain Dialogue Generation

alexzhou907/dialogue_evaluation • • ACL 2020

Open-domain dialogue generation has gained increasing attention in Natural Language Processing.

Paper
Code

Improving Dialog Evaluation with a Multi-reference Adversarial Dataset and Large Scale Pretraining

iitmnlp/Dialogue-Evaluation-with-BERT • • 23 Sep 2020

However, no such data is publicly available, and hence existing models are usually trained using a single relevant response and multiple randomly selected responses from other contexts (random negatives).

Paper
Code

GRADE: Automatic Graph-Enhanced Coherence Metric for Evaluating Open-Domain Dialogue Systems

li3cmz/GRADE • • EMNLP 2020

Capitalized on the topic-level dialogue graph, we propose a new evaluation metric GRADE, which stands for Graph-enhanced Representations for Automatic Dialogue Evaluation.

Paper
Code

Deconstruct to Reconstruct a Configurable Evaluation Metric for Open-Domain Dialogue Systems

vitouphy/usl_dialogue_metric • • COLING 2020

For instance, specificity is mandatory in a food-ordering dialogue task, whereas fluency is preferred in a language-teaching dialogue system.

Paper
Code

An Adversarially-Learned Turing Test for Dialog Generation Models

golsun/AdversarialTuringTest • • 16 Apr 2021

To alleviate this risk, we propose an adversarial training approach to learn a robust model, ATT (Adversarial Turing Test), that discriminates machine-generated responses from human-written replies.

Paper
Code

Dialogue Evaluation

Benchmarks Add a Result

Datasets

Most implemented papers

Content

Benchmarks

Add a Result