nlg evaluation

23 papers with code • 0 benchmarks • 0 datasets

Evaluate the generated text by NLG (Natural Language Generation) systems, like large language models

Most implemented papers

NLG Evaluation Metrics Beyond Correlation Analysis: An Empirical Metric Preference Checklist

inimah/metric-preference-checklist 15 May 2023

Our proposed framework provides access: (i) for verifying whether automatic metrics are faithful to human preference, regardless of their correlation level to human; and (ii) for inspecting the strengths and limitations of NLG systems via pairwise evaluation.

Towards a Unified Multi-Dimensional Evaluator for Text Generation

maszhongming/unieval 13 Oct 2022

We re-frame NLG evaluation as a Boolean Question Answering (QA) task, and by guiding the model with different questions, we can use one evaluator to evaluate from multiple dimensions.

G-Eval: NLG Evaluation using GPT-4 with Better Human Alignment

nlpyang/geval 29 Mar 2023

In this work, we present G-Eval, a framework of using large language models with chain-of-thoughts (CoT) and a form-filling paradigm, to assess the quality of NLG outputs.

Not All Metrics Are Guilty: Improving NLG Evaluation by Diversifying References

rucaibox/div-ref 24 May 2023

Most research about natural language generation (NLG) relies on evaluation benchmarks with limited references for a sample, which may result in poor correlations with human judgements.

Why We Need New Evaluation Metrics for NLG

jeknov/EMNLP_17_submission EMNLP 2017

The majority of NLG evaluation relies on automatic metrics, such as BLEU .

A Study of Automatic Metrics for the Evaluation of Natural Language Explanations

MirunaClinciu/ExBAN EACL 2021

As transparency becomes key for robotics and AI, it will be necessary to evaluate the methods through which transparency is provided, including automatically generated natural language (NL) explanations.

Perturbation CheckLists for Evaluating NLG Evaluation Metrics

iitmnlp/evaleval EMNLP 2021

Natural Language Generation (NLG) evaluation is a multifaceted task requiring assessment of multiple desirable criteria, e. g., fluency, coherency, coverage, relevance, adequacy, overall quality, etc.

Compression, Transduction, and Creation: A Unified Framework for Evaluating Natural Language Generation

tanyuqian/ctc-gen-eval EMNLP 2021

Based on the nature of information change from input to output, we classify NLG tasks into compression (e. g., summarization), transduction (e. g., text rewriting), and creation (e. g., dialog).

Active Evaluation: Efficient NLG Evaluation with Few Pairwise Comparisons

akashkm99/duelnlg ACL 2022

In this work, we introduce Active Evaluation, a framework to efficiently identify the top-ranked system by actively choosing system pairs for comparison using dueling bandit algorithms.

Bridging Cross-Lingual Gaps During Leveraging the Multilingual Sequence-to-Sequence Pretraining for Text Generation and Understanding

zanchangtong/csr4mbart 16 Apr 2022

For multilingual sequence-to-sequence pretrained language models (multilingual Seq2Seq PLMs), e. g. mBART, the self-supervised pretraining task is trained on a wide range of monolingual languages, e. g. 25 languages from CommonCrawl, while the downstream cross-lingual tasks generally progress on a bilingual language subset, e. g. English-German, making there exists the data discrepancy, namely domain discrepancy, and cross-lingual learning objective discrepancy, namely task discrepancy, between the pretraining and finetuning stages.