Large end-to-end neural open-domain chatbots are becoming increasingly popular.
Automatic speech recognition (ASR) via call is essential for various applications, including AI for contact center (AICC) services.
Open-domain dialog generation is a challenging problem; maximum likelihood training can lead to repetitive outputs, models have difficulty tracking long-term conversational goals, and training on standard movie or online datasets may lead to the generation of inappropriate, biased, or offensive text.
Most deep reinforcement learning (RL) systems are not able to learn effectively from off-policy data, especially if they cannot explore online in the environment.
To investigate the strengths of this novel metric and interactive evaluation in comparison to state-of-the-art metrics and human evaluation of static conversations, we perform extended experiments with a set of models, including several that make novel improvements to recent hierarchical dialog generation architectures through sentiment and semantic knowledge distillation on the utterance level.
The lack of meaningful automatic evaluation metrics for dialog has impeded open-domain dialog research.
The aim of this paper is to mitigate the shortcomings of automatic evaluation of open-domain dialog systems through multi-reference evaluation.
The predominant approach to open-domain dialog generation relies on end-to-end training of neural models on chat datasets.
It is important to define meaningful and interpretable automatic evaluation metrics for open-domain dialog research.
Large-scale pretrained language models define state of the art in natural language processing, achieving outstanding performance on a variety of tasks.