USR: An Unsupervised and Reference Free Evaluation Metric for Dialog Generation

ACL 2020  ·  Shikib Mehri, Maxine Eskenazi ·

The lack of meaningful automatic evaluation metrics for dialog has impeded open-domain dialog research. Standard language generation metrics have been shown to be ineffective for evaluating dialog models. To this end, this paper presents USR, an UnSupervised and Reference-free evaluation metric for dialog. USR is a reference-free metric that trains unsupervised models to measure several desirable qualities of dialog. USR is shown to strongly correlate with human judgment on both Topical-Chat (turn-level: 0.42, system-level: 1.0) and PersonaChat (turn-level: 0.48 and system-level: 1.0). USR additionally produces interpretable measures for several desirable properties of dialog.

PDF Abstract ACL 2020 PDF ACL 2020 Abstract

Datasets


Task Dataset Model Metric Name Metric Value Global Rank Result Benchmark
Dialogue Evaluation USR-PersonaChat USR - DR (x = c) Spearman Correlation 0.4814 # 2
Pearson Correlation 0.6087 # 1
Dialogue Evaluation USR-PersonaChat USR - DR (x = f) Spearman Correlation -0.0495 # 5
Pearson Correlation -0.0454 # 5
Dialogue Evaluation USR-PersonaChat USR - MLM Spearman Correlation 0.0795 # 4
Pearson Correlation 0.0788 # 4
Dialogue Evaluation USR-PersonaChat USR Spearman Correlation 0.4693 # 3
Pearson Correlation 0.4115 # 3
Dialogue Evaluation USR-TopicalChat USR - DR (x = f) Spearman Correlation 0.1419 # 6
Pearson Correlation 0.3221 # 6
Dialogue Evaluation USR-TopicalChat USR - DR (x = c) Spearman Correlation 0.3245 # 4
Pearson Correlation 0.4068 # 4
Dialogue Evaluation USR-TopicalChat USR - MLM Spearman Correlation 0.3086 # 5
Pearson Correlation 0.3345 # 5
Dialogue Evaluation USR-TopicalChat USR Spearman Correlation 0.4192 # 3
Pearson Correlation 0.4220 # 3

Methods


No methods listed for this paper. Add relevant methods here