Dialogue Evaluation

48 papers with code • 2 benchmarks • 6 datasets

This task has no description! Would you like to contribute one?

Most implemented papers

Evaluating Coherence in Dialogue Systems using Entailment

nouhadziri/DialogEntailment NAACL 2019

Evaluating open-domain dialogue systems is difficult due to the diversity of possible correct answers.

Towards Best Experiment Design for Evaluating Dialogue System Output

sashank06/INLG_eval WS 2019

To overcome the limitations of automated metrics (e. g. BLEU, METEOR) for evaluating dialogue systems, researchers typically use human judgments to provide convergent evidence.

PONE: A Novel Automatic Evaluation Metric for Open-Domain Generative Dialogue Systems

gmftbyGMFTBY/PONE 6 Apr 2020

Through extensive experiments, the learning-based metrics are demonstrated that they are the most effective evaluation metrics for open-domain generative dialogue systems.

USR: An Unsupervised and Reference Free Evaluation Metric for Dialog Generation

shikib/usr ACL 2020

The lack of meaningful automatic evaluation metrics for dialog has impeded open-domain dialog research.

Learning an Unreferenced Metric for Online Dialogue Evaluation

facebookresearch/online_dialog_eval ACL 2020

Evaluating the quality of a dialogue interaction between two agents is a difficult task, especially in open-domain chit-chat style dialogue.

Towards Holistic and Automatic Evaluation of Open-Domain Dialogue Generation

alexzhou907/dialogue_evaluation ACL 2020

Open-domain dialogue generation has gained increasing attention in Natural Language Processing.

Improving Dialog Evaluation with a Multi-reference Adversarial Dataset and Large Scale Pretraining

iitmnlp/Dialogue-Evaluation-with-BERT 23 Sep 2020

However, no such data is publicly available, and hence existing models are usually trained using a single relevant response and multiple randomly selected responses from other contexts (random negatives).

GRADE: Automatic Graph-Enhanced Coherence Metric for Evaluating Open-Domain Dialogue Systems

li3cmz/GRADE EMNLP 2020

Capitalized on the topic-level dialogue graph, we propose a new evaluation metric GRADE, which stands for Graph-enhanced Representations for Automatic Dialogue Evaluation.

Deconstruct to Reconstruct a Configurable Evaluation Metric for Open-Domain Dialogue Systems

vitouphy/usl_dialogue_metric COLING 2020

For instance, specificity is mandatory in a food-ordering dialogue task, whereas fluency is preferred in a language-teaching dialogue system.

An Adversarially-Learned Turing Test for Dialog Generation Models

golsun/AdversarialTuringTest 16 Apr 2021

To alleviate this risk, we propose an adversarial training approach to learn a robust model, ATT (Adversarial Turing Test), that discriminates machine-generated responses from human-written replies.