This is not a Dataset: A Large Negation Benchmark to Challenge Large Language Models

24 Oct 2023  ยท  Iker Garcรญa-Ferrero, Begoรฑa Altuna, Javier รlvez, Itziar Gonzalez-Dios, German Rigau ยท

Although large language models (LLMs) have apparently acquired a certain level of grammatical knowledge and the ability to make generalizations, they fail to interpret negation, a crucial step in Natural Language Processing. We try to clarify the reasons for the sub-optimal performance of LLMs understanding negation. We introduce a large semi-automatically generated dataset of circa 400,000 descriptive sentences about commonsense knowledge that can be true or false in which negation is present in about 2/3 of the corpus in different forms. We have used our dataset with the largest available open LLMs in a zero-shot approach to grasp their generalization and inference capability and we have also fine-tuned some of the models to assess whether the understanding of negation can be trained. Our findings show that, while LLMs are proficient at classifying affirmative sentences, they struggle with negative sentences and lack a deep understanding of negation, often relying on superficial cues. Although fine-tuning the models on negative sentences improves their performance, the lack of generalization in handling negation is persistent, highlighting the ongoing challenges of LLMs regarding negation understanding and generalization. The dataset and code are publicly available.

PDF Abstract

Datasets


Introduced in the Paper:

This is not a Dataset
Task Dataset Model Metric Name Metric Value Global Rank Benchmark
Zero-Shot Text Classification This is not a Dataset Flan-T5-xxl Accuracy 66.1 # 1
Coherence 0.9 # 1
Zero-Shot Text Classification This is not a Dataset Falcon40B-instruct Accuracy 54.7 # 4
Coherence 0.1 # 3
Zero-Shot Text Classification This is not a Dataset WizardLM 30B Accuracy 57.3 # 3
Coherence 0.0 # 4
Zero-Shot Text Classification This is not a Dataset Vicuna 13B v1.1 Accuracy 57.8 # 2
Coherence 0.2 # 2
Zero-Shot Text Classification This is not a Dataset LlaMA 65B Accuracy 50.3 # 5
Coherence 0.0 # 4
Text Classification This is not a Dataset Flan-T5-xxl Accuracy 94.1 # 2
Coherence 51.8 # 2
Text Classification This is not a Dataset Vicuna13B v1.1 Accuracy 95.7 # 1
Coherence 81.2 # 1

Methods


No methods listed for this paper. Add relevant methods here