Textual Entailment Recognition has been proposed recently as a generic task that captures major semantic inference needs across many NLP applications, such as Question Answering, Information Retrieval, Information Extraction, and Text Summarization. This task requires to recognize, given two text fragments, whether the meaning of one text is entailed (can be inferred) from the other text.
7 PAPERS • 1 BENCHMARK
Natural Language Inference (NLI), also called Textual Entailment, is an important task in NLP with the goal of determining the inference relationship between a premise p and a hypothesis h. It is a three-class problem, where each pair (p, h) is assigned to one of these classes: "ENTAILMENT" if the hypothesis can be inferred from the premise, "CONTRADICTION" if the hypothesis contradicts the premise, and "NEUTRAL" if none of the above holds. There are large datasets such as SNLI, MNLI, and SciTail for NLI in English, but there are few datasets for poor-data languages like Persian. Persian (Farsi) language is a pluricentric language spoken by around 110 million people in countries like Iran, Afghanistan, and Tajikistan. FarsTail is the first relatively large-scale Persian dataset for NLI task. A total of 10,367 samples are generated from a collection of 3,539 multiple-choice questions. The train, validation, and test portions include 7,266, 1,537, and 1,564 instances, respectively.
6 PAPERS • 1 BENCHMARK
JGLUE, Japanese General Language Understanding Evaluation, is built to measure the general NLU ability in Japanese.
6 PAPERS • NO BENCHMARKS YET
The Russian Commitment Bank is a corpus of naturally occurring discourses whose final sentence contains a clause-embedding predicate under an entailment cancelling operator (question, modal, negation, antecedent of conditional).
An IMPlicature and PRESupposition diagnostic dataset (IMPPRES), consisting of >25k semiautomatically generated sentence pairs illustrating well-studied pragmatic inference types.
5 PAPERS • NO BENCHMARKS YET
LiDiRus is a diagnostic dataset that covers a large volume of linguistic phenomena, while allowing you to evaluate information systems on a simple test of textual entailment recognition. See more details diagnostics.
5 PAPERS • 1 BENCHMARK
RuSentRel is a corpus of analytical articles translated into Russian texts in the domain of international politics obtained from foreign authoritative sources. The collected articles contain both the author's opinion on the subject matter of the article and a large number of references mentioned between the participants of the described situations. In total, 73 large analytical texts were labeled with about 2000 relations.
BioNLI is a dataset in biomedical natural language inference. This dataset contains abstracts from biomedical literature and mechanistic premises generated with nine different strategies.
3 PAPERS • 1 BENCHMARK
IndoNLI is the first human-elicited NLI dataset for Indonesian consisting of nearly 18K sentence pairs annotated by crowd workers and experts.
3 PAPERS • NO BENCHMARKS YET
The dataset contains 3304 cases from the Supreme Court of the United States from 1955 to 2021. Each case has the case's identifiers as well as the facts of the case and the decision outcome. Other related datasets rarely included the facts of the case which could prove to be helpful in natural language processing applications. One potential use case of this dataset is determining the outcome of a case using its facts.
Pars-ABSA is a manually annotated Persian dataset, Pars-ABSA, which is verified by 3 native Persian speakers. The dataset consists of 5,114 positive, 3,061 negative and 1,827 neutral data samples from 5,602 unique reviews.
Huggingface Datasets is a great library, but it lacks standardization, and datasets require preprocessing work to be used interchangeably. tasksource automates this and facilitates reproducible multi-task learning scaling.
WiLI-2018 is a benchmark dataset for monolingual written natural language identification. WiLI-2018 is a publicly available, free of charge dataset of short text extracts from Wikipedia. It contains 1000 paragraphs of 235 languages, totaling in 23500 paragraphs. WiLI is a classification dataset: Given an unknown paragraph written in one dominant language, it has to be decided which language it is.
XWINO is a multilingual collection of Winograd Schemas in six languages that can be used for evaluation of cross-lingual commonsense reasoning capabilities.
esXNLI is a bilingual NLI dataset. It comprises 2,490 examples from 5 different genres that were originally annotated in Spanish, and translated into English by professional translators. It serves as a counterpoint to XNLI, which was originally annotated in English and translated into 14 other languages, including Spanish. The dataset was conceived to be used in conjunction with the XNLI development set to analyse the effect of translation in cross-lingual transfer learning.
Natural Language Inference processes pairs of sentences to extract their semantic relations. Pair sentences are annotated with three classes (Contradictions, Entailment, Neutral).
2 PAPERS • NO BENCHMARKS YET
The Japanese Adversarial NLI (JaNLI) dataset is designed to require understanding of Japanese linguistic phenomena and illuminate the vulnerabilities of models. Please see the paper Assessing the Generalization Capacity of Pre-trained Language Models through Japanese Adversarial Natural Language Inference for details.
We generate epistemic reasoning problems using modal logic to target theory of mind (tom) in natural language processing models.
2 PAPERS • 1 BENCHMARK
Natural Language Inference in Turkish (NLI-TR) provides translations of two large English NLI datasets into Turkish and had a team of experts validate their translation quality and fidelity to the original labels.
NewsPH-NLI is a sentence entailment benchmark dataset in the low-resource Filipino language.
The CANDOR corpus is a large, novel, multimodal corpus of 1,656 recorded conversations in spoken English. This 7+ million word, 850 hour corpus totals over 1TB of audio, video, and transcripts, with moment-to-moment measures of vocal, facial, and semantic expression, along with an extensive survey of speaker post conversation reflections.
1 PAPER • NO BENCHMARKS YET
This dataset is named as the DistNLI dataset, which is a synthesized benchmark aiming to probe neural network models from the aspect of conjunctions on distributivity in NLI task in American English. DistNLI consists of sentence minimal pairs (premise and hypothesis) differentiated with conjunction structure within the pair and distributivity-related linguistic phenomenon. DistNLI is compiled with 328 sentences so far (164 for distributive and 164 for ambiguous predicates), annotated by 4 proficient English speakers with a background in NLP and Linguistics. Due to the specificity of the linguistic phenomenon involved and its size, this DistNLI dataset should only be used as an adversarial dataset in the investigation of distributivity of verb predication.
This is a set of debiased Natural Language Inference (NLI) datasets produced by the paper Generating Data to Mitigate Spurious Correlations in Natural Language Inference Datasets. The datasets are constructed by augmenting SNLI or MNLI with data samples that are generated to mitigate the spurious correlations in the original datasets. Please visit this repository for more details.
GQNLI-FR is a manually translated French version of the GQNLI challenge dataset, originally written in English.
The Gigaword Entailment dataset is a dataset for entailment prediction between an article and its headline. It is built from the Gigaword dataset.
The HANS (Heuristic Analysis for NLI Systems) dataset which contains many examples where the heuristics fail.
1 PAPER • 1 BENCHMARK
NLI4Wills Corpus can be used to train transformers and sentence-transformer models for the validity evaluation of the legal will statements. Our dataset consists of ID numbers, three types of inputs (legal will statements, laws, and conditions) and classifications (support, refute, or unrelated).
This dataset tests the capabilities of language models to correctly capture the meaning of words denoting probabilities (WEP), e.g. words like "probably", "maybe", "surely", "impossible".
PropSegmEnt is a corpus of over 35K propositions annotated by expert human raters. The dataset structure resembles the tasks of (1) segmenting sentences within a document to the set of propositions, and (2) classifying the entailment relation of each proposition with respect to a different yet topically-aligned document, i.e. documents describing the same event or entity.
RTE3-FR dataset is the French translation of the Textual Entailment English dataset used in the RTE-3 Challenge (https://nlp.stanford.edu/RTE3-pilot).
This dataset tests the capabilities of language models to correctly capture the meaning of words denoting probabilities (WEP, also called verbal probabilities), e.g. words like "probably", "maybe", "surely", "impossible".