Published on Mon Apr 06 2020

PONE: A Novel Automatic Evaluation Metric for Open-Domain Generative Dialogue Systems

Tian Lan, Xian-Ling Mao, Wei Wei, Xiaoyan Gao, Heyan Huang

Open-domain generative dialogue systems have attracted considerable attention over the past few years. Currently, how to automatically evaluate them, is still a big challenge problem. There are three kinds of automatic methods to evaluate them: Word-overlap, Embedding and Learning-based metrics.

0
0
0
Abstract

Open-domain generative dialogue systems have attracted considerable attention over the past few years. Currently, how to automatically evaluate them, is still a big challenge problem. As far as we know, there are three kinds of automatic methods to evaluate the open-domain generative dialogue systems: (1) Word-overlap-based metrics; (2) Embedding-based metrics; (3) Learning-based metrics. Due to the lack of systematic comparison, it is not clear which kind of metrics are more effective. In this paper, we will first measure systematically all kinds of automatic evaluation metrics over the same experimental setting to check which kind is best. Through extensive experiments, the learning-based metrics are demonstrated that they are the most effective evaluation metrics for open-domain generative dialogue systems. Moreover, we observe that nearly all learning-based metrics depend on the negative sampling mechanism, which obtains an extremely imbalanced and low-quality dataset to train a score model. In order to address this issue, we propose a novel and feasible learning-based metric that can significantly improve the correlation with human judgments by using augmented POsitive samples and valuable NEgative samples, called PONE. Extensive experiments demonstrate that our proposed evaluation method significantly outperforms the state-of-the-art learning-based evaluation methods, with an average correlation improvement of 13.18%. In addition, we have publicly released the codes of our proposed method and state-of-the-art baselines.

Wed Feb 12 2020
Machine Learning
Learning to Compare for Better Training and Evaluation of Open Domain Natural Language Generation Models
Automated evaluation of open domain natural language generation (NLG) models remains a challenge. We propose to evaluate natural language generation models by learning to compare a pair of generated sentences by fine-tuning BERT. We evaluate our approach on both story generation and chit-chat response generation.
0
0
0
Thu Jun 29 2017
NLP
Relevance of Unsupervised Metrics in Task-Oriented Dialogue for Evaluating Natural Language Generation
Automated metrics such as BLEU are widely used in the machine translation literature. They have also been used recently in the dialogue community for evaluating dialogue response generation. Previous work has shown that these metrics do not correlate strongly with human judgment in the non task-oriented dialogue setting.
0
0
0
Mon Sep 28 2020
Artificial Intelligence
DialoGLUE: A Natural Language Understanding Benchmark for Task-Oriented Dialogue
DialoGLUE is a public benchmark consisting of 7 task-oriented dialogue datasets. It is designed to encourage dialogue research in representation-based transfer, domain adaptation, and sample-efficient task learning.
0
0
0
Thu Jun 10 2021
NLP
AUGNLG: Few-shot Natural Language Generation using Self-trained Data Augmentation
AUGNLG combines a self-trained neural retrieval model with a few-shot NLU model. The proposed system mostly outperforms the state-of-the-art methods on FewShotWOZ data in both BLEU and Slot Error Rate.
1
0
0
Sun May 30 2021
NLP
REAM: An Enhancement Approach to Reference-based Evaluation Metrics for Open-domain Dialog Generation
Reference-based metrics have been proposed to calculate a score between a predicted response and a small set of references. These metrics show unsatisfactory relations with human judgments. We show how a prediction model can be helpful to augment the reference set, and thus improve the reliability of the metric.
1
0
0
Tue Sep 07 2021
NLP
Naturalness Evaluation of Natural Language Generation in Task-oriented Dialogues using BERT
This paper presents an automatic method to evaluate the naturalness of natural language generation in dialogue systems. By fine-tuning the BERT model, our proposed naturalness evaluation method shows robust results.
0
0
0