🔔 Share your dataset with the ML community!

Filter by Modality (clear)

Filter by Task

Filter by Language

2575 dataset results for Texts

ProtoQA is a question answering dataset for training and evaluating common sense reasoning capabilities of artificial intelligence systems in such prototypical situations. The training set is gathered from an existing set of questions played in a long-running international game show FAMILY- FEUD. The hidden evaluation set is created by gathering answers for each question from 100 crowd-workers.

9 PAPERS • NO BENCHMARKS YET

ROSCOE

ROSCOE is a suite of interpretable, unsupervised automatic scores that improve and extend previous text generation evaluation metrics.

9 PAPERS • NO BENCHMARKS YET

Rainbow

Rainbow is multi-task benchmark for common-sense reasoning that uses different existing QA datasets: aNLI, Cosmos QA, HellaSWAG. Physical IQa, Social IQa, WinoGrande.

9 PAPERS • NO BENCHMARKS YET

SPARTQA (SPAtial Reasoning on Textual Question Answering)

SpartQA is a textual question answering benchmark for spatial reasoning on natural language text which contains more realistic spatial phenomena not covered by prior datasets and that is challenging for state-of-the-art language models (LM).

9 PAPERS • NO BENCHMARKS YET

SPARTQA - (SPAtial Reasoning on Textual Question Answering.)

We take advantage of the ground truth of NLVR images, design CFGs to generate stories, and use spatial reasoning rules to ask and answer spatial reasoning questions. This automatically generated data is called SpaRTQA. https://aclanthology.org/2021.naacl-main.364/

9 PAPERS • NO BENCHMARKS YET

SciTail

The SciTail dataset is an entailment dataset created from multiple-choice science exams and web sentences. Each question and the correct answer choice are converted into an assertive statement to form the hypothesis. We use information retrieval to obtain relevant text from a large text corpus of web sentences, and use these sentences as a premise P. We crowdsource the annotation of such premise-hypothesis pair as supports (entails) or not (neutral), in order to create the SciTail dataset. The dataset contains 27,026 examples with 10,101 examples with entails label and 16,925 examples with neutral label.

9 PAPERS • 1 BENCHMARK

SelQA

SelQA is a dataset that consists of questions generated through crowdsourcing and sentence length answers that are drawn from the ten most prevalent topics in the English Wikipedia.

9 PAPERS • NO BENCHMARKS YET

StepGame

A Benchmark for Robust Multi-Hop Spatial Reasoning in Texts

9 PAPERS • 1 BENCHMARK

TEMPO

TEMPO (Localizing Moments in Video with Temporal Language)

TEMPOral reasoning in video and language (TEMPO) is a dataset that consists of two parts: a dataset with real videos and template sentences (TEMPO - Template Language) which allows for controlled studies on temporal language, and a human language dataset which consists of temporal sentences annotated by humans (TEMPO - Human Language).

9 PAPERS • NO BENCHMARKS YET

UA-GEC

UA-GEC (UA-GEC: Grammatical Error Correction and Fluency Corpus for the Ukrainian Language)

UA-GEC: Grammatical Error Correction and Fluency Corpus for the Ukrainian Language

9 PAPERS • 1 BENCHMARK

VLEP (Video-and-Language Event Prediction)

VLEP contains 28,726 future event prediction examples (along with their rationales) from 10,234 diverse TV Show and YouTube Lifestyle Vlog video clips. Each example (see Figure 1) consists of a Premise Event (a short video clip with dialogue), a Premise Summary (a text summary of the premise event), and two potential natural language Future Events (along with Rationales) written by people. These clips are on average 6.1 seconds long and are harvested from diverse event-rich sources, i.e., TV show and YouTube Lifestyle Vlog videos.

9 PAPERS • 1 BENCHMARK

ViP-Bench (Making Large Multimodal Models Understand Arbitrary Visual Prompts)

ViP-Bench is a comprehensive benchmark designed to assess the capability of multimodal models in understanding visual prompts across multiple dimensions. It aims to evaluate how well these models interpret various visual prompts, including recognition, OCR, knowledge, math, relationship reasoning, and language generation. ViP-Bench includes a diverse set of 303 images and questions, providing a thorough assessment of visual understanding capabilities at the region level. This benchmark sets a foundation for future research into multimodal models with arbitrary visual prompts.

9 PAPERS • 1 BENCHMARK

VideoCC3M (Video-Conceptual-Captions)

We propose a new, scalable video-mining pipeline which transfers captioning supervision from image datasets to video and audio. We use this pipeline to mine paired video and captions, using the Conceptual Captions3M image dataset as a seed dataset. Our resulting dataset VideoCC3M consists of millions of weakly paired clips with text captions and will be released publicly.

9 PAPERS • NO BENCHMARKS YET

WiC-TSV

WiC-TSV (Words-in-Context: Target Sense Verification)

WiC-TSV is a new multi-domain evaluation benchmark for Word Sense Disambiguation. More specifically, it is a framework for Target Sense Verification of Words in Context which grounds its uniqueness in the formulation as a binary classification task thus being independent of external sense inventories, and the coverage of various domains. This makes the dataset highly flexible for the evaluation of a diverse set of models and systems in and across domains. WiC-TSV provides three different evaluation settings, depending on the input signals provided to the model.

9 PAPERS • 2 BENCHMARKS

WikiAtomicEdits

WikiAtomicEdits is a corpus of 43 million atomic edits across 8 languages. These edits are mined from Wikipedia edit history and consist of instances in which a human editor has inserted a single contiguous phrase into, or deleted a single contiguous phrase from, an existing sentence.

9 PAPERS • NO BENCHMARKS YET

WikiConv

A corpus that encompasses the complete history of conversations between contributors to Wikipedia, one of the largest online collaborative communities. By recording the intermediate states of conversations---including not only comments and replies, but also their modifications, deletions and restorations---this data offers an unprecedented view of online conversation.

9 PAPERS • NO BENCHMARKS YET

e-ViL

e-ViL is a benchmark for explainable vision-language tasks. e-ViL spans across three datasets of human-written NLEs (natural language explanations), and provides a unified evaluation framework that is designed to be re-usable for future works.

9 PAPERS • NO BENCHMARKS YET

nvBench

nvBench is a large-scale NL2VIS (natural languagge to visualisations) benchmark, containing 25,750 (NL, VIS) pairs from 750 tables over 105 domains, synthesized from (NL, SQL) benchmarks to support cross-domain NLPVIS (Natural Language Query to Visualization) task.

9 PAPERS • NO BENCHMARKS YET

unarXive

A scholarly data set with publications’ full-text, annotated in-text citations, and links to metadata.

9 PAPERS • NO BENCHMARKS YET

ABCD (Action-Based Conversations Dataset)

Action-Based Conversations Dataset (ABCD) is a goal-oriented dialogue fully-labeled dataset with over 10K human-to-human dialogues containing 55 distinct user intents requiring unique sequences of actions constrained by policies to achieve task success. The dataset is proposed to study customer service dialogue systems in more realistic settings.

8 PAPERS • 1 BENCHMARK

ArSentD-LEV

The Arabic Sentiment Twitter Dataset for the Levantine dialect (ArSenTD-LEV) is a dataset of 4,000 tweets with the following annotations: the overall sentiment of the tweet, the target to which the sentiment was expressed, how the sentiment was expressed, and the topic of the tweet.

8 PAPERS • NO BENCHMARKS YET

BB (Bacteria Biotope)

The Bacteria Biotope (BB) Task is part of the BioNLP Open Shared Tasks and meets the BioNLP-OST standards of quality, originality and data formats. Manually annotated data is provided for training, development and evaluation of information extraction methods. Tools for the detailed evaluation of system outputs are available. Support in performing linguistic processing are provided in the form of analyses created by various state-of-the art tools on the dataset texts.

8 PAPERS • 2 BENCHMARKS

BEAT2 (BEAT-SMPLX-FLAME)

We propose EMAGE, a framework to generate full-body human gestures from audio and masked gestures, encompassing facial, local body, hands, and global movements. To achieve this, we first introduce BEAT2 (BEAT-SMPLX-FLAME), a new mesh-level holistic co-speech dataset. BEAT2 combines MoShed SMPLX body with FLAME head parameters and further refines the modeling of head, neck, and finger movements, offering a community-standardized, high-quality 3D motion captured dataset. EMAGE leverages masked body gesture priors during training to boost inference performance. It involves a Masked Audio Gesture Transformer, facilitating joint training on audio-to-gesture generation and masked gesture reconstruction to effectively encode audio and body gesture hints. Encoded body hints from masked gestures are then separately employed to generate facial and body movements. Moreover, EMAGE adaptively merges speech features from the audio's rhythm and content and utilizes four compositional VQ-VAEs to enh

8 PAPERS • 2 BENCHMARKS

BIOMRC

A large-scale cloze-style biomedical MRC dataset. Care was taken to reduce noise, compared to the previous BIOREAD dataset of Pappas et al. (2018).

8 PAPERS • 1 BENCHMARK

BiToD

BiToD is a bilingual multi-domain dataset for end-to-end task-oriented dialogue modeling. BiToD contains over 7k multi-domain dialogues (144k utterances) with a large and realistic bilingual knowledge base. It serves as an effective benchmark for evaluating bilingual ToD systems and cross-lingual transfer learning approaches.

8 PAPERS • NO BENCHMARKS YET

Bongard-HOI

Bongard-HOI testifies to which extent your few-shot visual learner can quickly induce the true HOI concept from a handful of images and perform reasoning with it. Further, the learner is also expected to transfer the learned few-shot skills to novel HOI concepts compositionally.

8 PAPERS • 1 BENCHMARK

Business Scene Dialogue

The Japanese-English business conversation corpus, namely Business Scene Dialogue corpus, was constructed in 3 steps:

8 PAPERS • 2 BENCHMARKS

CAS-VSR-W1k (LRW-1000)

LRW-1000 has been renamed as CAS-VSR-W1k.* It is a naturally-distributed large-scale benchmark for word-level lipreading in the wild, including 1000 classes with about 718,018 video samples from more than 2000 individual speakers. There are more than 1,000,000 Chinese character instances in total. Each class corresponds to the syllables of a Mandarin word which is composed by one or several Chinese characters. This dataset aims to cover a natural variability over different speech modes and imaging conditions to incorporate challenges encountered in practical applications.

8 PAPERS • 1 BENCHMARK

CMB

CMB (Comprehensive Medical Benchmark in Chinese)

CMB is a comprehensive, multi-level Medical Benchmark in Chinese. It encompasses 280,839 multiple-choice questions and 74 complex case consultation questions, covering all clinical medical specialties and various professional levels. The platform aims to holistically evaluate a model's medical knowledge and clinical consultation capabilities.

8 PAPERS • NO BENCHMARKS YET

CMRC 2019

CMRC 2019 (Chinese Machine Reading Comprehension 2019)

CMRC 2019 is a Chinese Machine Reading Comprehension dataset that was used in The Third Evaluation Workshop on Chinese Machine Reading Comprehension. Specifically, CMRC 2019 is a sentence cloze-style machine reading comprehension dataset that aims to evaluate the sentence-level inference ability.

8 PAPERS • 3 BENCHMARKS

CMU Movie Summary Corpus

Dataset [46 M] and readme: 42,306 movie plot summaries extracted from Wikipedia + aligned metadata extracted from Freebase, including: Movie box office revenue, genre, release date, runtime, and language Character names and aligned information about the actors who portray them, including gender and estimated age at the time of the movie's release Supplement: Stanford CoreNLP-processed summaries [628 M]. All of the plot summaries from above, run through the Stanford CoreNLP pipeline (tagging, parsing, NER and coref).

8 PAPERS • NO BENCHMARKS YET

CMeEE

CMeEE (Chinese Medical Named Entity Recognition Dataset)

Chinese Medical Named Entity Recognition, a dataset first released in CHIP20204, is used for CMeEE task. Given a pre-defined schema, the task is to identify and extract entities from the given sentence and classify them into nine categories: disease, clinical manifestations, drugs, medical equipment, medical procedures, body, medical examinations, microorganisms, and department.

8 PAPERS • 1 BENCHMARK

CaSiNo

CaSiNo is a dataset of 1030 negotiation dialogues in English. To create the dataset, two participates take the role of campsite neighbors and negotiate for Food, Water, and Firewood packages, based on their individual preferences and requirements. This design keeps the task tractable, while still facilitating linguistically rich and personal conversations.

8 PAPERS • NO BENCHMARKS YET

Chart2Text (Chart Summarization Dataset)

Chart2Text is a dataset that was crawled from 23,382 freely accessible pages from statista.com in early March of 2020, yielding a total of 8,305 charts, and associated summaries. For each chart, the chart image, the underlying data table, the title, the axis labels, and a human-written summary describing the statistic was downloaded.

8 PAPERS • NO BENCHMARKS YET

CoNLL-2000

CoNLL-2000 is a dataset for dividing text into syntactically related non-overlapping groups of words, so-called text chunking.

8 PAPERS • 2 BENCHMARKS

Com2Sense (Complementary Commonsense)

Complementary Commonsense (Com2Sense) is a dataset for benchmarking commonsense reasoning ability of NLP models. This dataset contains 4k statement true/false sentence pairs. The dataset is crowdsourced and enhanced with an adversarial model-in-the-loop setup to incentivize challenging samples. To facilitate a systematic analysis of commonsense capabilities, the dataset is designed along the dimensions of knowledge domains, reasoning scenarios and numeracy.

8 PAPERS • NO BENCHMARKS YET

DL-HARD (Deep Learning Hard)

Deep Learning Hard (DL-HARD) is an annotated dataset designed to more effectively evaluate neural ranking models on complex topics. It builds on TREC Deep Learning (DL) questions extensively annotated with query intent categories, answer types, wikified entities, topic categories, and result type metadata from a leading web search engine.

8 PAPERS • NO BENCHMARKS YET

Description Detection Dataset

Description Detection Dataset ($D^3$, /dikju:b/) is an attempt at creating a next-generation object detection dataset. Unlike traditional detection datasets, the class names of the objects are no longer simple nouns or noun phrases, but rather complex and descriptive, such as a dog not being held by a leash. For each image in the dataset, any object that matches the description is annotated. The dataset provides annotations such as bounding boxes and finely crafted instance masks.It comprises of 422 well-designed descriptions and 24,282 positive object-description pairs.

8 PAPERS • 1 BENCHMARK

DocNLI

DocNLI is a large-scale dataset for document-level NLI. DocNLI is transformed from a broad range of NLP problems and covers multiple genres of text. The premises always stay in the document granularity, whereas the hypotheses vary in length from single sentences to passages with hundreds of words. Additionally, DocNLI has pretty limited artifacts which unfortunately widely exist in some popular sentence-level NLI datasets.

8 PAPERS • NO BENCHMARKS YET

E-KAR (Benchmark for Explainable Knowledge-intensive Analogical Reasoning)

The ability to recognize analogies is fundamental to human cognition. Existing benchmarks to test word analogy do not reveal the underneath process of analogical reasoning of neural models.

8 PAPERS • NO BENCHMARKS YET

EmoWOZ

EmoWOZ is the first large-scale open-source dataset for emotion recognition in task-oriented dialogues. It contains emotion annotations for user utterances in the entire MultiWOZ (10k+ human-human dialogues) and DialMAGE (1k human-machine dialogues collected from our human trial). Overall, there are 83k user utterances annotated. In addition, the emotion annotation scheme is tailored to task-oriented dialogues and considers the valence, the elicitor, and the conduct of the user emotion.

8 PAPERS • 1 BENCHMARK

GUM (Georgetown University Multilayer corpus)

GUM is an open source multilayer English corpus of richly annotated texts from twelve text types. Annotations include:

8 PAPERS • 1 BENCHMARK

GeneCIS

GeneCIS benchmark is designed for measuring models’ ability to adapt to a range of similarity conditions, which is zero-shot evaluation only.

8 PAPERS • NO BENCHMARKS YET

Global Voices

Global Voices is a multilingual dataset for evaluating cross-lingual summarization methods. It is extracted from social-network descriptions of Global Voices news articles to cheaply collect evaluation data for into-English and from-English summarization in 15 languages.

8 PAPERS • NO BENCHMARKS YET

HOList

The official HOList benchmark for automated theorem proving consists of all theorem statements in the core, complex, and flyspeck corpora. The goal of the benchmark is to prove as many theorems as possible in the HOList environment in the order they appear in the database. That is, only theorems that occur before the current theorem are supposed to be used as premises (lemmata) in its proof.

8 PAPERS • 1 BENCHMARK

KnowIT VQA

KnowIT VQA is a video dataset with 24,282 human-generated question-answer pairs about The Big Bang Theory. The dataset combines visual, textual and temporal coherence reasoning together with knowledge-based questions, which need of the experience obtained from the viewing of the series to be answered.

8 PAPERS • 1 BENCHMARK

Lani

LANI is a 3D navigation environment and corpus, where an agent navigates between landmarks. Lani contains 27,965 crowd-sourced instructions for navigation in an open environment. Each datapoint includes an instruction, a human-annotated ground-truth demonstration trajectory, and an environment with various landmarks and lakes. The dataset train/dev/test split is 19,758/4,135/4,072. Each environment specification defines placement of 6–13 landmarks within a square grass field of size 50m×50m.

8 PAPERS • NO BENCHMARKS YET

Logic2Text

Logic2Text is a large-scale dataset with 10,753 descriptions involving common logic types paired with the underlying logical forms. The logical forms show diversified graph structure of free schema, which poses great challenges on the model's ability to understand the semantics.

8 PAPERS • NO BENCHMARKS YET

Datasets

2575 dataset results for Texts