Dialogue Evaluation
48 papers with code • 2 benchmarks • 6 datasets
Most implemented papers
Evaluating Coherence in Dialogue Systems using Entailment
Evaluating open-domain dialogue systems is difficult due to the diversity of possible correct answers.
Towards Best Experiment Design for Evaluating Dialogue System Output
To overcome the limitations of automated metrics (e. g. BLEU, METEOR) for evaluating dialogue systems, researchers typically use human judgments to provide convergent evidence.
PONE: A Novel Automatic Evaluation Metric for Open-Domain Generative Dialogue Systems
Through extensive experiments, the learning-based metrics are demonstrated that they are the most effective evaluation metrics for open-domain generative dialogue systems.
USR: An Unsupervised and Reference Free Evaluation Metric for Dialog Generation
The lack of meaningful automatic evaluation metrics for dialog has impeded open-domain dialog research.
Learning an Unreferenced Metric for Online Dialogue Evaluation
Evaluating the quality of a dialogue interaction between two agents is a difficult task, especially in open-domain chit-chat style dialogue.
Towards Holistic and Automatic Evaluation of Open-Domain Dialogue Generation
Open-domain dialogue generation has gained increasing attention in Natural Language Processing.
Improving Dialog Evaluation with a Multi-reference Adversarial Dataset and Large Scale Pretraining
However, no such data is publicly available, and hence existing models are usually trained using a single relevant response and multiple randomly selected responses from other contexts (random negatives).
GRADE: Automatic Graph-Enhanced Coherence Metric for Evaluating Open-Domain Dialogue Systems
Capitalized on the topic-level dialogue graph, we propose a new evaluation metric GRADE, which stands for Graph-enhanced Representations for Automatic Dialogue Evaluation.
Deconstruct to Reconstruct a Configurable Evaluation Metric for Open-Domain Dialogue Systems
For instance, specificity is mandatory in a food-ordering dialogue task, whereas fluency is preferred in a language-teaching dialogue system.
An Adversarially-Learned Turing Test for Dialog Generation Models
To alleviate this risk, we propose an adversarial training approach to learn a robust model, ATT (Adversarial Turing Test), that discriminates machine-generated responses from human-written replies.