IndoNLI is the first human-elicited NLI dataset for Indonesian consisting of nearly 18K sentence pairs annotated by crowd workers and experts.
3 PAPERS • NO BENCHMARKS YET
The dataset contains 3304 cases from the Supreme Court of the United States from 1955 to 2021. Each case has the case's identifiers as well as the facts of the case and the decision outcome. Other related datasets rarely included the facts of the case which could prove to be helpful in natural language processing applications. One potential use case of this dataset is determining the outcome of a case using its facts.
Huggingface Datasets is a great library, but it lacks standardization, and datasets require preprocessing work to be used interchangeably. tasksource automates this and facilitates reproducible multi-task learning scaling.
WiLI-2018 is a benchmark dataset for monolingual written natural language identification. WiLI-2018 is a publicly available, free of charge dataset of short text extracts from Wikipedia. It contains 1000 paragraphs of 235 languages, totaling in 23500 paragraphs. WiLI is a classification dataset: Given an unknown paragraph written in one dominant language, it has to be decided which language it is.
XWINO is a multilingual collection of Winograd Schemas in six languages that can be used for evaluation of cross-lingual commonsense reasoning capabilities.
3 PAPERS • 1 BENCHMARK
esXNLI is a bilingual NLI dataset. It comprises 2,490 examples from 5 different genres that were originally annotated in Spanish, and translated into English by professional translators. It serves as a counterpoint to XNLI, which was originally annotated in English and translated into 14 other languages, including Spanish. The dataset was conceived to be used in conjunction with the XNLI development set to analyse the effect of translation in cross-lingual transfer learning.
BioNLI is a dataset in biomedical natural language inference. This dataset contains abstracts from biomedical literature and mechanistic premises generated with nine different strategies.
2 PAPERS • 1 BENCHMARK
We generate epistemic reasoning problems using modal logic to target theory of mind (tom) in natural language processing models.
NewsPH-NLI is a sentence entailment benchmark dataset in the low-resource Filipino language.
2 PAPERS • NO BENCHMARKS YET
The CANDOR corpus is a large, novel, multimodal corpus of 1,656 recorded conversations in spoken English. This 7+ million word, 850 hour corpus totals over 1TB of audio, video, and transcripts, with moment-to-moment measures of vocal, facial, and semantic expression, along with an extensive survey of speaker post conversation reflections.
1 PAPER • NO BENCHMARKS YET
This dataset is named as the DistNLI dataset, which is a synthesized benchmark aiming to probe neural network models from the aspect of conjunctions on distributivity in NLI task in American English. DistNLI consists of sentence minimal pairs (premise and hypothesis) differentiated with conjunction structure within the pair and distributivity-related linguistic phenomenon. DistNLI is compiled with 328 sentences so far (164 for distributive and 164 for ambiguous predicates), annotated by 4 proficient English speakers with a background in NLP and Linguistics. Due to the specificity of the linguistic phenomenon involved and its size, this DistNLI dataset should only be used as an adversarial dataset in the investigation of distributivity of verb predication.
This is a set of debiased Natural Language Inference (NLI) datasets produced by the paper Generating Data to Mitigate Spurious Correlations in Natural Language Inference Datasets. The datasets are constructed by augmenting SNLI or MNLI with data samples that are generated to mitigate the spurious correlations in the original datasets. Please visit this repository for more details.
The Gigaword Entailment dataset is a dataset for entailment prediction between an article and its headline. It is built from the Gigaword dataset.
NLI4Wills Corpus can be used to train transformers and sentence-transformer models for the validity evaluation of the legal will statements. Our dataset consists of ID numbers, three types of inputs (legal will statements, laws, and conditions) and classifications (support, refute, or unrelated).
This dataset tests the capabilities of language models to correctly capture the meaning of words denoting probabilities (WEP), e.g. words like "probably", "maybe", "surely", "impossible".
1 PAPER • 1 BENCHMARK
PropSegmEnt is a corpus of over 35K propositions annotated by expert human raters. The dataset structure resembles the tasks of (1) segmenting sentences within a document to the set of propositions, and (2) classifying the entailment relation of each proposition with respect to a different yet topically-aligned document, i.e. documents describing the same event or entity.
This dataset tests the capabilities of language models to correctly capture the meaning of words denoting probabilities (WEP, also called verbal probabilities), e.g. words like "probably", "maybe", "surely", "impossible".