Towards Standard Criteria for human evaluation of Chatbots: A Survey
Human evaluation is becoming a necessity to test the performance of Chatbots. However, off-the-shelf settings suffer the severe reliability and replication issues partly because of the extremely high diversity of criteria. It is high time to come up with standard criteria and exact definitions. To this end, we conduct a through investigation of 105 papers involving human evaluation for Chatbots. Deriving from this, we propose five standard criteria along with precise definitions.
PDF AbstractTasks
Datasets
Add Datasets
introduced or used in this paper
Results from the Paper
Submit
results from this paper
to get state-of-the-art GitHub badges and help the
community compare results to other papers.
Methods
No methods listed for this paper. Add
relevant methods here