Published on Fri Oct 09 2020

Mark-Evaluate: Assessing Language Generation using Population Estimation Methods

Gonçalo Mordido, Christoph Meinel

We use Mark-recapture and maximum-likelihood methods to estimate the size of closed populations in the wild. In synthetic experiments, our family of methods is sensitive to drops in quality and diversity. Our methods show a higher correlation to human evaluation than existing metrics.

0
0
0
Abstract

We propose a family of metrics to assess language generation derived from population estimation methods widely used in ecology. More specifically, we use mark-recapture and maximum-likelihood methods that have been applied over the past several decades to estimate the size of closed populations in the wild. We propose three novel metrics: MEPetersen and MECAPTURE, which retrieve a single-valued assessment, and MESchnabel which returns a double-valued metric to assess the evaluation set in terms of quality and diversity, separately. In synthetic experiments, our family of methods is sensitive to drops in quality and diversity. Moreover, our methods show a higher correlation to human evaluation than existing metrics on several challenging tasks, namely unconditional language generation, machine translation, and text summarization.

Tue Mar 19 2019
NLP
compare-mt: A Tool for Holistic Comparison of Language Generation Systems
compare-mt is a tool for holistic analysis of systems for language generation tasks. The main goal of the tool is to give the user a high-level and coherent view of the salient differences between systems. The code is available on GitHub.
0
0
0
Wed Apr 22 2020
NLP
Trading Off Diversity and Quality in Natural Language Generation
For open-ended language generation tasks such as storytelling and dialogue, choosing the right decoding algorithm is critical to controlling the tradeoff between generation quality and diversity. There presently exists no consensus on which decoding procedure is best or even the criteria by which to compare them.
0
0
0
Mon Oct 26 2020
NLP
Curious Case of Language Generation Evaluation Metrics: A Cautionary Tale
Metrics are de facto metrics to evaluate tasks such as image captioning and machine translation. This is partly due to ease of use, and partly because researchers expect to see them. We urge the community for more careful consideration of how they automatically evaluate their models.
1
0
1
Mon May 10 2021
NLP
Societal Biases in Language Generation: Progress and Challenges
language generation presents unique challenges in terms of direct user interaction and the structure of decoding techniques. We present a survey on societal biases in language generation, focusing on how techniques contribute to biases. We also conduct experiments to quantify the effects of these techniques.
9
1
3
Mon Apr 06 2020
NLP
Evaluating the Evaluation of Diversity in Natural Language Generation
There is currently no principled method for evaluating diversity of an NLG system. The framework measures the correlation between a proposed diversity metric and a diversity parameter. We demonstrate the utility of our framework by establishing best practices for eliciting diversity judgments.
0
0
0
Tue Sep 03 2019
NLP
The Woman Worked as a Babysitter: On Biases in Language Generation
We present a systematic study of biases in natural language generation (NLG) by analyzing text generated from prompts that contain mentions of different demographic groups. We use the varying levels of regard towards different demographics as a defining metric for bias in NLG.
1
0
0