Natural Language Processing

nlg evaluation

23 papers with code • 0 benchmarks • 0 datasets

Evaluate the generated text by NLG (Natural Language Generation) systems, like large language models

Benchmarks

Add a Result

These leaderboards are used to track progress in nlg evaluation

No evaluation results yet. Help compare methods by submitting evaluation metrics.

Most implemented papers

Most implemented Social Latest No code

NLG Evaluation Metrics Beyond Correlation Analysis: An Empirical Metric Preference Checklist

inimah/metric-preference-checklist • • 15 May 2023

Our proposed framework provides access: (i) for verifying whether automatic metrics are faithful to human preference, regardless of their correlation level to human; and (ii) for inspecting the strengths and limitations of NLG systems via pairwise evaluation.

Paper
Code

Towards a Unified Multi-Dimensional Evaluator for Text Generation

maszhongming/unieval • 13 Oct 2022

We re-frame NLG evaluation as a Boolean Question Answering (QA) task, and by guiding the model with different questions, we can use one evaluator to evaluate from multiple dimensions.

Paper
Code

G-Eval: NLG Evaluation using GPT-4 with Better Human Alignment

nlpyang/geval • 29 Mar 2023

In this work, we present G-Eval, a framework of using large language models with chain-of-thoughts (CoT) and a form-filling paradigm, to assess the quality of NLG outputs.

Paper
Code

Not All Metrics Are Guilty: Improving NLG Evaluation by Diversifying References

rucaibox/div-ref • • 24 May 2023

Most research about natural language generation (NLG) relies on evaluation benchmarks with limited references for a sample, which may result in poor correlations with human judgements.

Paper
Code

Why We Need New Evaluation Metrics for NLG

jeknov/EMNLP_17_submission • EMNLP 2017

The majority of NLG evaluation relies on automatic metrics, such as BLEU .

Paper
Code

A Study of Automatic Metrics for the Evaluation of Natural Language Explanations

MirunaClinciu/ExBAN • EACL 2021

As transparency becomes key for robotics and AI, it will be necessary to evaluate the methods through which transparency is provided, including automatically generated natural language (NL) explanations.

Paper
Code

Perturbation CheckLists for Evaluating NLG Evaluation Metrics

iitmnlp/evaleval • EMNLP 2021

Natural Language Generation (NLG) evaluation is a multifaceted task requiring assessment of multiple desirable criteria, e. g., fluency, coherency, coverage, relevance, adequacy, overall quality, etc.

Paper
Code

Compression, Transduction, and Creation: A Unified Framework for Evaluating Natural Language Generation

tanyuqian/ctc-gen-eval • • EMNLP 2021

Based on the nature of information change from input to output, we classify NLG tasks into compression (e. g., summarization), transduction (e. g., text rewriting), and creation (e. g., dialog).

Paper
Code

Active Evaluation: Efficient NLG Evaluation with Few Pairwise Comparisons

akashkm99/duelnlg • • ACL 2022

In this work, we introduce Active Evaluation, a framework to efficiently identify the top-ranked system by actively choosing system pairs for comparison using dueling bandit algorithms.

Paper
Code

Bridging Cross-Lingual Gaps During Leveraging the Multilingual Sequence-to-Sequence Pretraining for Text Generation and Understanding

zanchangtong/csr4mbart • • 16 Apr 2022

For multilingual sequence-to-sequence pretrained language models (multilingual Seq2Seq PLMs), e. g. mBART, the self-supervised pretraining task is trained on a wide range of monolingual languages, e. g. 25 languages from CommonCrawl, while the downstream cross-lingual tasks generally progress on a bilingual language subset, e. g. English-German, making there exists the data discrepancy, namely domain discrepancy, and cross-lingual learning objective discrepancy, namely task discrepancy, between the pretraining and finetuning stages.

Paper
Code

nlg evaluation

Benchmarks Add a Result

Most implemented papers

Content

Benchmarks

Add a Result