🔔 Share your dataset with the ML community!

Filter by Modality (clear)

Filter by Task

Filter by Language

2548 dataset results for Texts

IGLUE (Image-Grounded Language Understanding Evaluation)

The Image-Grounded Language Understanding Evaluation (IGLUE) benchmark brings together—by both aggregating pre-existing datasets and creating new ones—visual question answering, cross-modal retrieval, grounded reasoning, and grounded entailment tasks across 20 diverse languages. The benchmark enables the evaluation of multilingual multimodal models for transfer learning, not only in a zero-shot setting, but also in newly defined few-shot learning setups.

21 PAPERS • 13 BENCHMARKS

KdConv (Knowledge-driven Conversation)

KdConv is a Chinese multi-domain Knowledge-driven Conversation dataset, grounding the topics in multi-turn conversations to knowledge graphs. KdConv contains 4.5K conversations from three domains (film, music, and travel), and 86K utterances with an average turn number of 19.0. These conversations contain in-depth discussions on related topics and natural transition between multiple topics, while the corpus can also used for exploration of transfer learning and domain adaptation.

21 PAPERS • NO BENCHMARKS YET

LitBank

LitBank is an annotated dataset of 100 works of English-language fiction to support tasks in natural language processing and the computational humanities, described in more detail in the following publications:

21 PAPERS • 1 BENCHMARK

PopQA

PopQA is an open-domain QA dataset with 14k QA pairs with fine-grained Wikidata entity ID, Wikipedia page views, and relationship type information.

21 PAPERS • NO BENCHMARKS YET

STREUSLE

STREUSLE stands for Supersense-Tagged Repository of English with a Unified Semantics for Lexical Expressions. The text is from the web reviews portion of the English Web Treebank [9]. STREUSLE incorporates comprehensive annotations of multiword expressions (MWEs) [1] and semantic supersenses for lexical expressions. The supersense labels apply to single- and multiword noun and verb expressions, as described in [2], and prepositional/possessive expressions, as described in [3, 4, 5, 6, 7, 8]. Lexical expressions also feature a lexical category label indicating its holistic grammatical status; for verbal multiword expressions, these labels incorporate categories from the PARSEME 1.1 guidelines [15]. For each token, these pieces of information are concatenated together into a lextag: a sentence's words and their lextags are sufficient to recover lexical categories, supersenses, and multiword expressions [8].

21 PAPERS • 1 BENCHMARK

Text8

Desc: About of Text8

21 PAPERS • 1 BENCHMARK

TextOCR

TextOCR is a dataset to benchmark text recognition on arbitrary shaped scene-text. TextOCR requires models to perform text-recognition on arbitrary shaped scene-text present on natural images. TextOCR provides ~1M high quality word annotations on TextVQA images allowing application of end-to-end reasoning on downstream tasks such as visual question answering or image captioning.

21 PAPERS • NO BENCHMARKS YET

WikiSplit

Contains one million naturally occurring sentence rewrites, providing sixty times more distinct split examples and a ninety times larger vocabulary than the WebSplit corpus introduced by Narayan et al. (2017) as a benchmark for this task.

21 PAPERS • 1 BENCHMARK

AGQA (Action Genome Question Answering)

Action Genome Question Answering (AGQA) is a benchmark for compositional spatio-temporal reasoning. AGQA contains 192M unbalanced question answer pairs for 9.6K videos. It also contains a balanced subset of 3.9M question answer pairs, 3 orders of magnitude larger than existing benchmarks, that minimizes bias by balancing the answer distributions and types of question structures.

20 PAPERS • NO BENCHMARKS YET

FairytaleQA

FairytaleQA is a dataset focusing on narrative comprehension of kindergarten to eighth-grade students. Annotated by educational experts based on an evidence-based theoretical framework, FairytaleQA consists of 10,580 explicit and implicit questions derived from 278 children-friendly story narratives, covering seven types of narrative elements or relations. It can support narrative Question Generation (QG) and Narrative Question Answering (QA) tasks.

20 PAPERS • 2 BENCHMARKS

HONEST (Hurtful Sentence Completion in English Language Models)

The HONEST dataset is a template-based corpus for testing the hurtfulness of sentence completions in language models (e.g., BERT) in six different languages (English, Italian, French, Portuguese, Romanian, and Spanish). HONEST is composed of 420 instances for each language, which are generated from 28 identity terms (14 male and 14 female) and 15 templates. It uses a set of identifier terms in singular and plural (i.e., woman, women, girl, boys) and a series of predicates (i.e., “works as [MASK]”, “is known for [MASK]”). The objective is to use language models to fill the sentence, then the hurtfulness of the completion is evaluated.

20 PAPERS • 1 BENCHMARK

MIR-1K

MIR-1K (Multimedia Information Retrieval lab, 1000 song clips) is a dataset designed for singing voice separation. It contains:

20 PAPERS • NO BENCHMARKS YET

MLQE-PE

MLQE-PE (Multilingual Quality Estimation and Automatic Post-editing Dataset)

The Multilingual Quality Estimation and Automatic Post-editing (MLQE-PE) Dataset is a dataset for Machine Translation (MT) Quality Estimation (QE) and Automatic Post-Editing (APE). The dataset contains seven language pairs, with human labels for 9,000 translations per language pair in the following formats: sentence-level direct assessments and post-editing effort, and word-level good/bad labels. It also contains the post-edited sentences, as well as titles of the articles where the sentences were extracted from, and the neural MT models used to translate the text.

20 PAPERS • NO BENCHMARKS YET

MultiDoc2Dial

MultiDoc2Dial (MultiDoc2Dial: Modeling Dialogues Grounded in Multiple Documents)

MultiDoc2Dial is a new task and dataset on modeling goal-oriented dialogues grounded in multiple documents. Most previous works treat document-grounded dialogue modeling as a machine reading comprehension task based on a single given document or passage. We aim to address more realistic scenarios where a goal-oriented information-seeking conversation involves multiple topics, and hence is grounded on different documents.

20 PAPERS • NO BENCHMARKS YET

Paralex

Paralex learns from a collection of 18 million question-paraphrase pairs scraped from WikiAnswers.

20 PAPERS • 1 BENCHMARK

PrOntoQA (Proof and Ontology-Generated Question-Answering)

PrOntoQA is a question-answering dataset which generates examples with chains-of-thought that describe the reasoning required to answer the questions correctly. The sentences in the examples are syntactically simple and amenable to semantic parsing. It can be used to formally analyze the predicted chain-of-thought from large language models such as GPT-3.

20 PAPERS • NO BENCHMARKS YET

RECCON

RECCON is a dataset for the task of recognizing emotion cause in conversations.

20 PAPERS • 2 BENCHMARKS

Snopes

Fact-checking (FC) articles which contains pairs (multimodal tweet and a FC-article) from snopes.com.

20 PAPERS • 1 BENCHMARK

WIQA (What-If Question Answering)

The WIQA dataset V1 has 39705 questions containing a perturbation and a possible effect in the context of a paragraph. The dataset is split into 29808 train questions, 6894 dev questions and 3003 test questions.

20 PAPERS • NO BENCHMARKS YET

XGLUE

XGLUE is an evaluation benchmark XGLUE,which is composed of 11 tasks that span 19 languages. For each task, the training data is only available in English. This means that to succeed at XGLUE, a model must have a strong zero-shot cross-lingual transfer capability to learn from the English data of a specific task and transfer what it learned to other languages. Comparing to its concurrent work XTREME, XGLUE has two characteristics: First, it includes cross-lingual NLU and cross-lingual NLG tasks at the same time; Second, besides including 5 existing cross-lingual tasks (i.e. NER, POS, MLQA, PAWS-X and XNLI), XGLUE selects 6 new tasks from Bing scenarios as well, including News Classification (NC), Query-Ad Matching (QADSM), Web Page Ranking (WPR), QA Matching (QAM), Question Generation (QG) and News Title Generation (NTG). Such diversities of languages, tasks and task origin provide a comprehensive benchmark for quantifying the quality of a pre-trained model on cross-lingual natural lan

20 PAPERS • 2 BENCHMARKS

gRefCOCO

gRefCOCO is the first large-scale Generalized Referring Expression Segmentation dataset that contains multi-target, no-target, and single-target expressions.

20 PAPERS • 2 BENCHMARKS

iSarcasmEval

iSarcasmEval is the first shared task to target intended sarcasm detection: the data for this task was provided and labelled by the authors of the texts themselves. Such an approach minimises the downfalls of other methods to collect sarcasm data, which rely on distant supervision or third-party annotations. The shared task contains two languages, English and Arabic, and three subtasks: sarcasm detection, sarcasm category classification, and pairwise sarcasm identification given a sarcastic sentence and its non-sarcastic rephrase. The task received submissions from 60 different teams, with the sarcasm detection task being the most popular. Most of the participating teams utilised pre-trained language models. In this paper, we provide an overview of the task, data, and participating teams.

20 PAPERS • NO BENCHMARKS YET

ChFinAnn

Ten years (2008-2018) ChFinAnn documents and human-summarized event knowledge bases to conduct the DS-based event labeling. Five event types included: Equity Freeze (EF), Equity Repurchase (ER), Equity Underweight (EU), Equity Overweight (EO) and Equity Pledge (EP), which belong to major events required to be disclosed by the regulator and may have a huge impact on the company value. To ensure the labeling quality, the authors set constraints for matched document-record pairs.

19 PAPERS • 1 BENCHMARK

CrisisMMD

CrisisMMD is a large multi-modal dataset collected from Twitter during different natural disasters. It consists of several thousands of manually annotated tweets and images collected during seven major natural disasters including earthquakes, hurricanes, wildfires, and floods that happened in the year 2017 across different parts of the World. The provided datasets include three types of annotations.

19 PAPERS • 2 BENCHMARKS

DIRHA (Distant-speech Interaction for Robust Home Applications)

DIRHA-English is a multi-microphone database composed of real and simulated sequences of 1-minute. The overall corpus is composed of different types of sequences including: 1) Phonetically-rich sentences; 2) WSJ 5-k utterances; 3) WSJ 20-k utterances; 4) Conversational speech (also including keywords and commands). The sequences are available for both UK and US English at 48 kHz. The DIRHA-English dataset offers the possibility to work with a very large number of microphone channels, to use of microphone arrays having different characteristics and to work considering different speech recognition tasks (e.g., phone-loop, keyword spotting, ASR with small and very large language models).

19 PAPERS • 1 BENCHMARK

Event2Mind

Event2Mind is a corpus of 25,000 event phrases covering a diverse range of everyday events and situations.

19 PAPERS • 2 BENCHMARKS

FQuAD

FQuAD (French Question Answering Dataset)

A French Native Reading Comprehension dataset of questions and answers on a set of Wikipedia articles that consists of 25,000+ samples for the 1.0 version and 60,000+ samples for the 1.1 version.

19 PAPERS • 1 BENCHMARK

George Washington

The George Washington dataset contains 20 pages of letters written by George Washington and his associates in 1755 and thereby categorized into historical collection. The images are annotated at word level and contain approximately 5,000 words.

19 PAPERS • NO BENCHMARKS YET

HeadQA

HeadQA is a multi-choice question answering testbed to encourage research on complex reasoning. The questions come from exams to access a specialized position in the Spanish healthcare system, and are challenging even for highly specialized humans.

19 PAPERS • NO BENCHMARKS YET

IIRC

IIRC (Incomplete Information Reading Comprehension)

Contains more than 13K questions over paragraphs from English Wikipedia that provide only partial information to answer them, with the missing information occurring in one or more linked documents. The questions were written by crowd workers who did not have access to any of the linked documents, leading to questions that have little lexical overlap with the contexts where the answers appear.

19 PAPERS • NO BENCHMARKS YET

KLUE (Korean Language Understanding Evaluation)

Korean Language Understanding Evaluation (KLUE) benchmark is a series of datasets to evaluate natural language understanding capability of Korean language models. KLUE consists of 8 diverse and representative tasks, which are accessible to anyone without any restrictions. With ethical considerations in mind, we deliberately design annotation guidelines to obtain unambiguous annotations for all datasets. Furthermore, we build an evaluation system and carefully choose evaluations metrics for every task, thus establishing fair comparison across Korean language models.

19 PAPERS • 1 BENCHMARK

Moral Stories

Moral Stories is a crowd-sourced dataset of structured narratives that describe normative and norm-divergent actions taken by individuals to accomplish certain intentions in concrete situations, and their respective consequences.

19 PAPERS • NO BENCHMARKS YET

MultiFC

Publicly available dataset of naturally occurring factual claims for the purpose of automatic claim verification. It is collected from 26 fact checking websites in English, paired with textual sources and rich metadata, and labelled for veracity by human expert journalists.

19 PAPERS • NO BENCHMARKS YET

NetHack Learning Environment

The NetHack Learning Environment (NLE) is a Reinforcement Learning environment based on NetHack 3.6.6. It is designed to provide a standard reinforcement learning interface to the game, and comes with tasks that function as a first step to evaluate agents on this new environment. NetHack is one of the oldest and arguably most impactful videogames in history, as well as being one of the hardest roguelikes currently being played by humans. It is procedurally generated, rich in entities and dynamics, and overall an extremely challenging environment for current state-of-the-art RL agents, while being much cheaper to run compared to other challenging testbeds. Through NLE, the authors wish to establish NetHack as one of the next challenges for research in decision making and machine learning.

19 PAPERS • 1 BENCHMARK

QuaRel

QuaRel is a crowdsourced dataset of 2771 multiple-choice story questions, including their logical forms.

19 PAPERS • NO BENCHMARKS YET

RCTW-17 (Reading Chinese Text in the Wild)

Features a large-scale dataset with 12,263 annotated images. Two tasks, namely text localization and end-to-end recognition, are set up. The competition took place from January 20 to May 31, 2017. 23 valid submissions were received from 19 teams.

19 PAPERS • NO BENCHMARKS YET

RoboCup

RoboCup is an initiative in which research groups compete by enabling their robots to play football matches. Playing football requires solving several challenging tasks, such as vision, motion, and team coordination. Framing the research efforts onto football attracts public interest (and potential research funding) in robotics, which may otherwise be less entertaining to non-experts.

19 PAPERS • NO BENCHMARKS YET

SUBJ

SUBJ (Subjectivity dataset)

Available are collections of movie-review documents labeled with respect to their overall sentiment polarity (positive or negative) or subjective rating (e.g., "two and a half stars") and sentences labeled with respect to their subjectivity status (subjective or objective) or polarity.

19 PAPERS • 1 BENCHMARK

Senseval-2

There are now many computer programs for automatically determining the sense of a word in context (Word Sense Disambiguation or WSD). The purpose of SENSEVAL is to evaluate the strengths and weaknesses of such programs with respect to different words, different varieties of language, and different languages.

19 PAPERS • NO BENCHMARKS YET

WavCaps

A ChatGPT-Assisted Weakly-Labelled Audio Captioning Dataset for Audio-Language Multimodal Research.

19 PAPERS • NO BENCHMARKS YET

2010 i2b2/VA

2010 i2b2/VA is a biomedical dataset for relation classification and entity typing.

18 PAPERS • 4 BENCHMARKS

ALCE (Automatic LLMs' Citation Evaluation)

ALCE is a benchmark for Automatic LLMs' Citation Evaluation. ALCE collects a diverse set of questions and retrieval corpora and requires building end-to-end systems to retrieve supporting evidence and generate answers with citations.

18 PAPERS • NO BENCHMARKS YET

CLOTH (CLOze test by TeacHers)

The Cloze Test by Teachers (CLOTH) benchmark is a collection of nearly 100,000 4-way multiple-choice cloze-style questions from middle- and high school-level English language exams, where the answer fills a blank in a given text. Each question is labeled with a type of deep reasoning it involves, where the four possible types are grammar, short-term reasoning, matching/paraphrasing, and long-term reasoning, i.e., reasoning over multiple sentences

18 PAPERS • NO BENCHMARKS YET

CVSS

CVSS is a massively multilingual-to-English speech to speech translation (S2ST) corpus, covering sentence-level parallel S2ST pairs from 21 languages into English. CVSS is derived from the Common Voice speech corpus and the CoVoST 2 speech-to-text translation (ST) corpus, by synthesizing the translation text from CoVoST 2 into speech using state-of-the-art TTS systems

18 PAPERS • 1 BENCHMARK

CliCR

CliCR is a new dataset for domain specific reading comprehension used to construct around 100,000 cloze queries from clinical case reports.

18 PAPERS • 1 BENCHMARK

Creative Writing

A creative writing task where the input is 4 random sentences and the output should be a coherent passage with 4 paragraphs that end in the 4 input sentences respectively. Such a task is open-ended and exploratory, and challenges creative thinking as well as high-level planning.

18 PAPERS • NO BENCHMARKS YET

EXTREME CLASSIFICATION

EXTREME CLASSIFICATION (Extreme Multi-label Classification)

The objective in extreme multi-label classification is to learn feature architectures and classifiers that can automatically tag a data point with the most relevant subset of labels from an extremely large label set. This repository provides resources that can be used for evaluating the performance of extreme multi-label algorithms including datasets, code, and metrics.

18 PAPERS • NO BENCHMARKS YET

FNC-1 (Fake News Challenge Stage 1)

FNC-1 was designed as a stance detection dataset and it contains 75,385 labeled headline and article pairs. The pairs are labelled as either agree, disagree, discuss, and unrelated. Each headline in the dataset is phrased as a statement

18 PAPERS • 2 BENCHMARKS

Datasets

2548 dataset results for Texts