🔔 Share your dataset with the ML community!

Filter by Modality (clear)

Filter by Task (clear)

Filter by Language (clear)

22 dataset results for Classification AND Texts AND English

The Schema-Guided Dialogue (SGD) dataset consists of over 20k annotated multi-domain, task-oriented conversations between a human and a virtual assistant. These conversations involve interactions with services and APIs spanning 20 domains, ranging from banks and events to media, calendar, travel, and weather. For most of these domains, the dataset contains multiple different APIs, many of which have overlapping functionalities but different interfaces, which reflects common real-world scenarios. The wide range of available annotations can be used for intent prediction, slot filling, dialogue state tracking, policy imitation learning, language generation, user simulation learning, among other tasks in large-scale virtual assistants. Besides these, the dataset has unseen domains and services in the evaluation set to quantify the performance in zero-shot or few shot settings.

169 PAPERS • 3 BENCHMARKS

ALTA 2021 Shared Task

ALTA 2021 Shared Task (Automatic Grading of Evidence, 10 years later)

This dataset is described in the ALTA 2021 Shared Task website and associated CodaLab competition.

4 PAPERS • NO BENCHMARKS YET

Tasksource

Huggingface Datasets is a great library, but it lacks standardization, and datasets require preprocessing work to be used interchangeably. tasksource automates this and facilitates reproducible multi-task learning scaling.

3 PAPERS • NO BENCHMARKS YET

Xhate999

We present XHate-999, a multi-domain and multilingual evaluation data set for abusive language detection. By aligning test instances across six typologically diverse languages, XHate-999 for the first time allows for disentanglement of the domain transfer and language transfer effects in abusive language detection. We conduct a series of domain- and language-transfer experiments with state-of-the-art monolingual and multilingual transformer models, setting strong baseline results and profiling XHate-999 as a comprehensive evaluation resource for abusive language detection. Finally, we show that domain- and language-adaption, via intermediate masked language modeling on abusive corpora in the target language, can lead to substantially improved abusive language detection in the target language in the zero-shot transfer setups.

3 PAPERS • NO BENCHMARKS YET

Dermatology ddx dataset

The dermatology differential diagnoses (ddx) dataset for skin condition classification includes expert annotations and model predictions for 1947 cases. Note that no images or meta information are provided. The expert annotations come in the form of differential diagnoses, i.e., partial rankings of conditions, and there is a high level of disagreement among experts, making this a perfect benchmark for dealing with disagreement. The data has been introduced in [1] and [2].

2 PAPERS • NO BENCHMARKS YET

IRFL: Image Recognition of Figurative Language

The IRFL dataset consists of idioms, similes, and metaphors with matching figurative and literal images, as well as two novel tasks of multimodal figurative understanding and preference.

2 PAPERS • 2 BENCHMARKS

MedSecId

The process by which sections in a document are demarcated and labeled is known as section identification. Such sections are helpful to the reader when searching for information and contextualizing specific topics. The goal of this work is to segment the sections of clinical medical domain documentation. The primary contribution of this work is MedSecId, a publicly available set of 2,002 fully annotated medical notes from the MIMIC-III. We include several baselines, source code, a pretrained model and analysis of the data showing a relationship between medical concepts across sections using principal component analysis.

2 PAPERS • 2 BENCHMARKS

CVE (Common Vulnerabilities and Exposures)

CVE stands for Common Vulnerabilities and Exposures. CVE is a glossary that classifies vulnerabilities. The glossary analyzes vulnerabilities and then uses the Common Vulnerability Scoring System (CVSS) to evaluate the threat level of a vulnerability. A CVE score is often used for prioritizing the security of vulnerabilities.

1 PAPER • NO BENCHMARKS YET

Cards Against Humanity

A dataset of games played in the card game "Cards Against Humanity" (CAH), by human players, derived from the online CAH labs. Each round includes the cards presented to users - a "black" prompt with a blank or question and 10 "white" punchlines as possible responses, and which punchline was picked by a player each round, along with text and metadata.

1 PAPER • NO BENCHMARKS YET

Colors

A large dataset of color names and their respective RGB values stores in CSV.

1 PAPER • 1 BENCHMARK

DeepParliament

DeepParliament is a legal domain Benchmark Dataset that gathers bill documents and metadata and performs various bill status classification tasks. The dataset text covers a broad range of bills from 1986 to the present and contains richer information on parliament bill content. There are a total of 5329 documents where 4223 are in the train and 1106 are in the test dataset. Each bill document contains many sentences in both cases, and the document’s length varies greatly.

1 PAPER • NO BENCHMARKS YET

Dissonance Twitter Dataset

Dissonance Twitter Dataset is a dataset collected from annotating tweets for dissonance.

1 PAPER • NO BENCHMARKS YET

FinBench

FinBench is a benchmark for evaluating the performance of machine learning models with both tabular data inputs and profile text inputs.

1 PAPER • NO BENCHMARKS YET

Food Recall Incidents Dataset

The Food Recall Incidents dataset consists of 7,546 short texts (from 5 to 360 characters each), which are the titles of food recall announcements (therefore referred to as title), crawled from 24 public food safety authority websites by Agroknow. The texts are written in 6 languages, with English (6,644) and German (888) being the most common, followed by French (8), Greek (4), Italian (1) and Danish (1). Most of the texts have been authored after 2010 and they describe recalls of specific food products due to specific hazards. Experts manually classified each text to four groups of classes describing hazards and products on two levels of granularity:

1 PAPER • NO BENCHMARKS YET

MiST

MiST (Modals In Scientific Text) is a dataset containing 3737 modal instances in five scientific domains annotated for their semantic, pragmatic, or rhetorical function.

1 PAPER • NO BENCHMARKS YET

RGZ EMU: Semantic Taxonomy

RGZ EMU: Semantic Taxonomy (Radio Galaxy Zoo EMU: Towards a Semantic Radio Galaxy Morphology Taxonomy)

The data used in - "Radio Galaxy Zoo EMU: Towards a Semantic Radio Galaxy Morphology Taxonomy" (Bowles et al. submitted) - "A New Task: Deriving Semantic Class Targets for the Physical Sciences" (Bowles et al. 2022: https://arxiv.org/abs/2210.14760) accepted at the Fifth Workshop on Machine Learning and the Physical Sciences, Neural Information Processing Systems 2022.

1 PAPER • NO BENCHMARKS YET

Reddit Ideology Database

Dataset with articles posted in the r/Liberal and r/Conservative subreddits. In total, we collected a corpus of 226,010 articles. We have collected news articles to understand political expression through the shared news articles.

1 PAPER • 1 BENCHMARK

SHADR

SHADR (sythetic SDoH Human Annotated Demographic Robustness dataset (SHADR))

SDoH Human Annotated Demoographic Robustness (SHADR) Dataset Overview The Social determinants of health (SDoH) play a pivotal role in determining patient outcomes. However, their documentation in electronic health records (EHR) remains incomplete. This dataset was created from a study examining the capability of large language models in extracting SDoH from the free text sections of EHRs. Furthermore, the study delved into the potential of synthetic clinical text to bolster the extraction process of these scarcely documented, yet crucial, clinical data.

1 PAPER • NO BENCHMARKS YET

ALFI (Annotations for Label-Free Images)

ALFI (Annotations for Label-Free Images) is a dataset of images and annotations for label-free microscopy imaging. It consists of 29 time-lapse image sequences with various annotations (pixel-wise segmentation masks, object-wise bounding boxes, and tracking information), made publicly available to the scientific community through figshare.

0 PAPER • NO BENCHMARKS YET

ALTA 2022 Shared Task

ALTA 2022 Shared Task (PIBOSO Sentence classification)

This dataset is described in the ALTA 2022 Shared Task and associated CodaLab competition.

0 PAPER • NO BENCHMARKS YET

ALTA 2023 Shared Task

ALTA 2023 Shared Task (Discriminate between human-authored and synthetic text generated by Large Language Models (LLMs))

This dataset is described in the ALTA 2023 Shared Task and associated CodaLab competition.

0 PAPER • NO BENCHMARKS YET

Big-Five Backstage

The dataset consists of 3265 text samples corresponding to the concatenation of lines spoken by fictional characters. Texts are extracted from 400 theatre plays written by 132 different authors. Overall, it contains 3419136 words in total with a mean equal to 1047.2 words per character. Text entries have binary labels representing gender of a character (Male or Female) and their five personality traits (Extraversion, Agreeableness, Openness, Neuroticism, Conscientiousness). The auxiliary part of the dataset includes author-level labels reflecting their gender, country of origin, and years of life.

0 PAPER • NO BENCHMARKS YET