Published on Thu May 13 2021

Towards Human-Free Automatic Quality Evaluation of German Summarization

Neslihan Iskender, Oleg Vasilyev, Tim Polzehl, John Bohannon, Sebastian Möller

Evaluating large summarization corpora using humans has proven to be expensive from both the organizational and the financial perspective. Since BLANC does not require golden summaries and can use any underlying language model, we consider its application to the evaluation of summarization in German.

1
1
2
Abstract

Evaluating large summarization corpora using humans has proven to be expensive from both the organizational and the financial perspective. Therefore, many automatic evaluation metrics have been developed to measure the summarization quality in a fast and reproducible way. However, most of the metrics still rely on humans and need gold standard summaries generated by linguistic experts. Since BLANC does not require golden summaries and supposedly can use any underlying language model, we consider its application to the evaluation of summarization in German. This work demonstrates how to adjust the BLANC metric to a language other than English. We compare BLANC scores with the crowd and expert ratings, as well as with commonly used automatic metrics on a German summarization data set. Our results show that BLANC in German is especially good in evaluating informativeness.

Wed Jun 02 2021
NLP
Evaluating the Efficacy of Summarization Evaluation across Languages
This is the first attempt to systematically quantify their panlinguistic efficacy. We take a summarization corpus for eight different languages, and manually annotate generated summaries. Using multilingual BERT within within within, we find that using BERTScore performs well across all languages.
1
3
17
Wed Oct 14 2020
Machine Learning
Re-evaluating Evaluation in Text Summarization
Automated evaluation metrics as a stand-in for manual evaluation are an essential part of the development of text-generation tasks. For nearly 20 years ROUGE has been the standard evaluation in most summarization papers.
0
0
0
Fri Apr 01 2016
NLP
Revisiting Summarization Evaluation for Scientific Articles
The most widely used metric in summarization evaluations has been the ROUGE family. We show contrary to the common belief, RouGE is not much reliable in evaluating scientific summaries. We propose an alternative metric for summarization evaluation which is based on content relevance.
0
0
0
Wed Jan 27 2021
NLP
How to Evaluate a Summarizer: Study Design and Statistical Analysis for Manual Linguistic Quality Evaluation
Manual evaluation is essential to judge progress on automatic text summarization. A survey on recent summarization system papers reveals little agreement on how to perform such evaluation studies.
0
0
0
Sun Nov 08 2020
Artificial Intelligence
Metrics also Disagree in the Low Scoring Range: Revisiting Summarization Evaluation Metrics
In text summarization, evaluating the efficacy of automatic metrics without human judgments has become popular. One exemplar work concludes that automatic metrics strongly disagree when ranking high-scoring summaries. We hypothesize that this may be because summaries are similar to each other in a narrow scoring range and are
0
0
0
Sun Feb 23 2020
NLP
Fill in the BLANC: Human-free quality estimation of document summaries
0
0
0
Sat Oct 24 2020
NLP
GO FIGURE: A Meta Evaluation of Factuality in Summarization
Neural language models can generate text with remarkable fluency andcoherence. However, controlling for factual correctness in generation remains an open research question. In this paper, we introduce GO FIGURE, a meta-evaluation framework for evaluating factuality evaluation metrics.
1
1
3
Thu May 07 2020
NLP
SUPERT: Towards New Frontiers in Unsupervised Evaluation Metrics for Multi-Document Summarization
We study unsupervised multi-document summarization evaluation metrics, which require neither human-written reference summaries nor human annotations. We propose SUPERT, which rates the quality of a purposefullysummary by measuring its semantic similarity with a pseudo reference summary. We use SUPERT as rewards to guide
0
0
0
Tue Aug 25 2015
NLP
Better Summarization Evaluation with Word Embeddings for ROUGE
ROUGE is a widely adopted, automatic evaluation measure for text summaries. While it has been shown to correlate well with human judgements, it is biased towards surface lexical similarities. We study the effectiveness of word embeddings to overcome this disadvantage.
0
0
0
Thu May 07 2020
NLP
FEQA: A Question Answering Evaluation Framework for Faithfulness Assessment in Abstractive Summarization
Neural abstractive summarization models are prone to generate content inconsistent with the source document. Existing automatic metrics do not capture such mistakes effectively. We propose an automatic question answering (QA) based metric for faithfulness, FEQA.
0
0
0
Wed Oct 14 2020
Machine Learning
Re-evaluating Evaluation in Text Summarization
Automated evaluation metrics as a stand-in for manual evaluation are an essential part of the development of text-generation tasks. For nearly 20 years ROUGE has been the standard evaluation in most summarization papers.
0
0
0
Thu Apr 11 2019
NLP
Crowdsourcing Lightweight Pyramids for Manual Summary Evaluation
The Pyramid protocol exhaustively compares system summaries to references. It has been perceived as very reliable, providing objective scores. We revisit the Pyramid approach, proposing a lightweight sampling-based version that is crowdsourcable.
0
0
0