🔔 Share your dataset with the ML community!

Filter by Modality

Filter by Task

Filter by Language (clear)

2900 dataset results for English

Project CodeNet is a large-scale dataset with approximately 14 million code samples, each of which is an intended solution to one of 4000 coding problems. The code samples are written in over 50 programming languages (although the dominant languages are C++, C, Python, and Java) and they are annotated with a rich set of information, such as its code size, memory footprint, cpu run time, and status, which indicates acceptance or error types. The dataset is accompanied by a repository, where we provide a set of tools to aggregate codes samples based on user criteria and to transform code samples into token sequences, simplified parse trees and other code graphs. A detailed discussion of Project CodeNet is available in this paper.

11 PAPERS • NO BENCHMARKS YET

QAMPARI

QAMPARI is an ODQA benchmark, where question answers are lists of entities, spread across many paragraphs. It was created by (a) generating questions with multiple answers from Wikipedia's knowledge graph and tables, (b) automatically pairing answers with supporting evidence in Wikipedia paragraphs, and (c) manually paraphrasing questions and validating each answer.

11 PAPERS • NO BENCHMARKS YET

ReCAM (SemEval-2021 Task 4: Reading Comprehension of Abstract Meaning)

Tasks Our shared task has three subtasks. Subtask 1 and 2 focus on evaluating machine learning models' performance with regard to two definitions of abstractness (Spreen and Schulz, 1966; Changizi, 2008), which we call imperceptibility and nonspecificity, respectively. Subtask 3 aims to provide some insights to their relationships.

11 PAPERS • 1 BENCHMARK

Real HSI

Real HSI (End-to-End Low Cost Compressive Spectral Imaging with Spatial-Spectral Self-Attention)

End-to-End Low Cost Compressive Spectral Imaging with Spatial-Spectral Self-Attention

11 PAPERS • 1 BENCHMARK

SMAC-Exp (StarCraft Multi-Agent Exploration Challenge)

The StarCraft Multi-Agent Challenges+ requires agents to learn completion of multi-stage tasks and usage of environmental factors without precise reward functions. The previous challenges (SMAC) recognized as a standard benchmark of Multi-Agent Reinforcement Learning are mainly concerned with ensuring that all agents cooperatively eliminate approaching adversaries only through fine manipulation with obvious reward functions. This challenge, on the other hand, is interested in the exploration capability of MARL algorithms to efficiently learn implicit multi-stage tasks and environmental factors as well as micro-control. This study covers both offensive and defensive scenarios. In the offensive scenarios, agents must learn to first find opponents and then eliminate them. The defensive scenarios require agents to use topographic features. For example, agents need to position themselves behind protective structures to make it harder for enemies to attack.

11 PAPERS • 13 BENCHMARKS

SODA

SODA is a high-quality social dialogue dataset. In contrast to most existing crowdsourced, small-scale dialogue corpora, Soda distills 1.5M socially-grounded dialogues from a pre-trained language model (InstructGPT; Ouyang et al., ). Dialogues are distilled by contextualizing social commonsense knowledge from a knowledge graph (Atomic10x).

11 PAPERS • NO BENCHMARKS YET

StylePTB

StylePTB is a fine-grained text style transfer benchmark. It consists of paired sentences undergoing 21 fine-grained stylistic changes spanning atomic lexical, syntactic, semantic, and thematic transfers of text, as well as compositions of multiple transfers which allow modelling of fine-grained stylistic changes as building blocks for more complex, high-level transfers.

11 PAPERS • NO BENCHMARKS YET

Synbols

Synbols is a dataset generator designed for probing the behavior of learning algorithms. By defining the distribution over latent factors one can craft a dataset specifically tailored to answer specific questions about a given algorithm.

11 PAPERS • NO BENCHMARKS YET

TRIPOD (TuRnIng POint Dataset)

TRIPOD contains screenplays and plot synopses with turning point (TP) annotations for 99 movies. Each movie contains:

11 PAPERS • NO BENCHMARKS YET

Talk the Walk

Talk The Walk is a large-scale dialogue dataset grounded in action and perception. The task involves two agents (a “guide” and a “tourist”) that communicate via natural language in order to achieve a common goal: having the tourist navigate to a given target location.

11 PAPERS • NO BENCHMARKS YET

TaxiNLI

TaxiNLI is a dataset collected based on the principles and categorizations of the aforementioned taxonomy. A subset of examples are curated from MultiNLI (Williams et al., 2018) by sampling uniformly based on the entailment label and the domain. The dataset is annotated with finegrained category labels.

11 PAPERS • NO BENCHMARKS YET

TimeDial

TimeDial presents a crowdsourced English challenge set, for temporal commonsense reasoning, formulated as a multiple choice cloze task with around 1.5k carefully curated dialogs. The dataset is derived from the DailyDialog, which is a multi-turn dialog corpus.

11 PAPERS • NO BENCHMARKS YET

UFPR-ALPR

This dataset includes 4,500 fully annotated images (over 30,000 license plate characters) from 150 vehicles in real-world scenarios where both the vehicle and the camera (inside another vehicle) are moving.

11 PAPERS • 1 BENCHMARK

VGMIDI

VGMIDI is a dataset of piano arrangements of video game soundtracks. It contains 200 MIDI pieces labeled according to emotion and 3,850 unlabeled pieces. Each labeled piece was annotated by 30 human subjects according to the Circumplex (valence-arousal) model of emotion using a custom web tool.

11 PAPERS • NO BENCHMARKS YET

ViGGO

The ViGGO corpus is a set of 6,900 meaning representation to natural language utterance pairs in the video game domain. The meaning representations are of 9 different dialogue acts.

11 PAPERS • 1 BENCHMARK

VoxForge

VoxForge is an open speech dataset that was set up to collect transcribed speech for use with Free and Open Source Speech Recognition Engines (on Linux, Windows and Mac). Image Source: http://www.voxforge.org/home

11 PAPERS • 9 BENCHMARKS

X-CSQA

X-CSQA is a multilingual dataset for Commonsense reasoning research, based on CSQA.

11 PAPERS • NO BENCHMARKS YET

BCI (Breast Cancer Immunohistochemical Image Generation)

The evaluation of human epidermal growth factor receptor 2 (HER2) expression is essential to formulate a precise treatment for breast cancer. The routine evaluation of HER2 is conducted with immunohistochemical techniques (IHC), which is very expensive. Therefore, we propose a breast cancer immunohistochemical (BCI) benchmark attempting to synthesize IHC data directly with the paired hematoxylin and eosin (HE) stained images. The dataset contains 4870 registered image pairs, covering a variety of HER2 expression levels (0, 1+, 2+, 3+).

10 PAPERS • 1 BENCHMARK

BiRD (Bigram Relatedness Dataset)

Bigram Relatedness Dataset (BiRD) is a large, fine-grained, bigram relatedness dataset, using a comparative annotation technique called Best Worst Scaling. Each of BiRD's 3,345 English term pairs involves at least one bigram. BiRD is made freely available to foster further research on how meaning can be represented and how meaning can be composed.

10 PAPERS • NO BENCHMARKS YET

ConditionalQA

ConditionalQA is a Question Answering (QA) dataset that contains complex questions with conditional answers, i.e. the answers are only applicable when certain conditions apply.

10 PAPERS • 1 BENCHMARK

DDPM

DDPM (Deception Detection and Physiological Monitoring)

The Deception Detection and Physiological Monitoring (DDPM) dataset captures an interview scenario in which the interviewee attempts to deceive the interviewer on selected responses. The interviewee is recorded in RGB, near-infrared, and long-wave infrared, along with cardiac pulse, blood oxygenation, and audio. After collection, data were annotated for interviewer/interviewee, curated, ground-truthed, and organized into train/test parts for a set of canonical deception detection experiments. The dataset contains almost 13 hours of recordings of 70 subjects, and over 8 million visible-light, near-infrared, and thermal video frames, along with appropriate meta, audio, and pulse oximeter data.

10 PAPERS • NO BENCHMARKS YET

Duolingo STAPLE Shared Task

This is the dataset for the 2020 Duolingo shared task on Simultaneous Translation And Paraphrase for Language Education (STAPLE). Sentence prompts, along with automatic translations, and high-coverage sets of translation paraphrases weighted by user response are provided in 5 language pairs. Starter code for this task can be found here: github.com/duolingo/duolingo-sharedtask-2020/. More details on the data set and task are available at: sharedtask.duolingo.com

10 PAPERS • NO BENCHMARKS YET

Ekman6

the YF-E6 emotion dataset using the 6 basic emotion type as keywords on social video-sharing websites including YouTube and Flickr, leading to a total of 3000 videos. The dataset is labeled through crowdsourcing by 10 different annotators (5 males and 5 females), whose age ranged from 22 to 45. Annotators were given detailed definition for each emotion before performing the task. Every video is manually labeled by all the annotators. A video is excluded from the final dataset when over half of annotations are inconsistent with the initial search keyword.

10 PAPERS • 1 BENCHMARK

FM-IQA (Freestyle Multilingual Image Question Answering)

FM-IQA is a question-answering dataset containing over 150,000 images and 310,000 freestyle Chinese question-answer pairs and their English translations.

10 PAPERS • NO BENCHMARKS YET

Fig-QA

Fig-QA consists of 10256 examples of human-written creative metaphors that are paired as a Winograd schema. It can be used to evaluate the commonsense reasoning of models. The metaphors themselves can also be used as training data for other tasks, such as metaphor detection or generation.

10 PAPERS • NO BENCHMARKS YET

Forest CoverType

Predicting forest cover type from cartographic variables only (no remotely sensed data). The actual forest cover type for a given observation (30 x 30 meter cell) was determined from US Forest Service (USFS) Region 2 Resource Information System (RIS) data. Independent variables were derived from data originally obtained from US Geological Survey (USGS) and USFS data. Data is in raw form (not scaled) and contains binary (0 or 1) columns of data for qualitative independent variables (wilderness areas and soil types).

10 PAPERS • 1 BENCHMARK

GMOT-40

GMOT-40 (Generic Multiple Object Tracking (GMOT))

GMOT-40 is the first public dense dataset for Generic Multiple Object Tracking (GMOT). It contains 40 carefully annotated sequences evenly distributed among 10 object categories. Beyond the data, a challenging protocal, one-shot GMOT, is adopted and a series of baseline algorithms is introduced. GMOT-40 is featured in

10 PAPERS • NO BENCHMARKS YET

HINT3

HINT3 is a dataset for intent detection. It consists of 3 different datasets each containing a diverse set of intents in a single domain - mattress products retail, fitness supplements retail and online gaming named SOFMattress, Curekart and Powerplay11.

10 PAPERS • NO BENCHMARKS YET

LectureBank

LectureBank Dataset is a manually collected dataset of lecture slides. It contains 1,352 online lecture files from 60 courses covering 5 different domains, including Natural Language Processing (nlp), Machine Learning (ml), Artificial Intelligence (ai), Deep Learning (dl) and Information Retrieval (ir). In addition, it also contains the corresponding annotations for each slide.

10 PAPERS • NO BENCHMARKS YET

MCubeS

MCubeS (Multimodal Material Segmentation Dataset)

Multimodal material segmentation (MCubeS) dataset contains 500 sets of images from 42 street scenes. Each scene has images for four modalities: RGB, angle of linear polarization (AoLP), degree of linear polarization (DoLP), and near-infrared (NIR). The dataset provides annotated ground truth labels for both material and semantic segmentation for every pixel. The dataset is divided training set with 302 image sets, validation set with 96 image sets, and test set with 102 image sets. Each image has 1224 x 1024 pixels and a total of 20 class labels per pixel.

10 PAPERS • 1 BENCHMARK

MINTAKA

MINTAKA is a complex, natural, and multilingual dataset designed for experimenting with end-to-end question-answering models. It is composed of 20,000 question-answer pairs collected in English, annotated with Wikidata entities, and translated into Arabic, French, German, Hindi, Italian, Japanese, Portuguese, and Spanish for a total of 180,000 samples. Mintaka includes 8 types of complex questions, including superlative, intersection, and multi-hop questions, which were naturally elicited from crowd workers.

10 PAPERS • NO BENCHMARKS YET

MultiEURLEX

MultiEURLEX is a multilingual dataset for topic classification of legal documents. The dataset comprises 65k European Union (EU) laws, officially translated in 23 languages, annotated with multiple labels from the EUROVOC taxonomy. The dataset covers 23 official EU languages from 7 language families.

10 PAPERS • NO BENCHMARKS YET

OCW (Only Connect Wall Dataset and creative problem solving tasks)

The OCW dataset is for evaluating creative problem solving tasks by curating the problems and human performance results from the popular British quiz show Only Connect.

10 PAPERS • 1 BENCHMARK

OVEN (Open-domain Visual Entity Recognition)

In this project, we formally present the task of Open-domain Visual Entity recognitioN (OVEN), where a model need to link an image onto a Wikipedia entity with respect to a text query. We construct OVEN-Wiki by re-purposing 14 existing datasets with all labels grounded onto one single label space: Wikipedia entities. OVEN challenges models to select among six million possible Wikipedia entities, making it a general visual recognition benchmark with the largest number of labels.

10 PAPERS • 1 BENCHMARK

Overruling

The Overruling dataset is a law dataset corresponding to the task of determining when a sentence is overruling a prior decision. This is a binary classification task, where positive examples are overruling sentences and negative examples are non-overruling sentences extracted from legal opinions. In law, an overruling sentence is a statement that nullifies a previous case decision as a precedent, by a constitutionally valid statute or a decision by the same or higher ranking court which establishes a different rule on the point of law involved. The Overruling dataset consists of 2,400 sentences.

10 PAPERS • 1 BENCHMARK

PartialSpoof

PartialSpoof is a dataset of partially-spoofed data to evaluate detection of partially-spoofed speech data. It has been built based on the ASVspoof 2019 LA database since the latter covers 17 types of spoofed data produced by advanced speech synthesizers, voice converters, and hybrids. The authors used the same set of bona fide data from the ASVspoof 2019 LA database but created partially spoofed audio from the ASVspoof 2019 LA data.

10 PAPERS • NO BENCHMARKS YET

PhotoBook

A large-scale collection of visually-grounded, task-oriented dialogues in English designed to investigate shared dialogue history accumulating during conversation.

10 PAPERS • NO BENCHMARKS YET

ReQA (Retrieval Question-Answering)

Retrieval Question-Answering (ReQA) benchmark tests a model’s ability to retrieve relevant answers efficiently from a large set of documents.

10 PAPERS • NO BENCHMARKS YET

SCICAP

SCICAP is a large-scale image captioning dataset that contains real-world scientific figures and captions. SCICAP was constructed using more than two million images from over 290,000 papers collected and released by arXiv.

10 PAPERS • 1 BENCHMARK

SinD (A Drone Dataset at Signalized Intersection in China)

The SIND dataset is based on 4K video captured by drones, providing information including traffic participant trajectories, traffic light status, and high-definition maps

10 PAPERS • NO BENCHMARKS YET

TED Gesture Dataset

Co-speech gestures are everywhere. People make gestures when they chat with others, give a public speech, talk on a phone, and even think aloud. Despite this ubiquity, there are not many datasets available. The main reason is that it is expensive to recruit actors/actresses and track precise body motions. There are a few datasets available (e.g., MSP AVATAR [17] and Personality Dyads Corpus [18]), but their sizes are limited to less than 3 h, and they lack diversity in speech content and speakers. The gestures also could be unnatural owing to inconvenient body tracking suits and acting in a lab environment.

10 PAPERS • 1 BENCHMARK

TekGen

The Dataset is part of the KELM corpus

10 PAPERS • 1 BENCHMARK

TheoremQA

We propose the first question-answering dataset driven by STEM theorems. We annotated 800 QA pairs covering 350+ theorems spanning across Math, EE&CS, Physics and Finance. The dataset is collected by human experts with very high quality. We provide the dataset as a new benchmark to test the limit of large language models to apply theorems to solve challenging university-level questions. We provide a pipeline in the following to prompt LLMs and evaluate their outputs with WolframAlpha.

10 PAPERS • 1 BENCHMARK

Yelp-Fraud (Multi-relational Graph Dataset for Yelp Spam Review Detection)

Yelp-Fraud is a multi-relational graph dataset built upon the Yelp spam review dataset, which can be used in evaluating graph-based node classification, fraud detection, and anomaly detection models.

10 PAPERS • 2 BENCHMARKS

arXiv Summarization Dataset

This is a dataset for evaluating summarisation methods for research papers.

10 PAPERS • 3 BENCHMARKS

2012 i2b2 Temporal Relations

2012 i2b2 Temporal Relations (2012 i2b2 Temporal Relations Corpus)

The Sixth Informatics for Integrating Biology and the Bedside (i2b2) Natural Language Processing Challenge for Clinical Records focused on the temporal relations in clinical narratives. The organizers provided the research community with a corpus of discharge summaries annotated with temporal information, to be used for the development and evaluation of temporal reasoning systems. 18 teams from around the world participated in the challenge. During the workshop, participating teams presented comprehensive reviews and analysis of their systems, and outlined future research directions suggested by the challenge contributions.

9 PAPERS • 2 BENCHMARKS

28 Ghz wireless channel dataset

Our dataset which consists of multiple indoor and outdoor experiments for up to 30 m gNB-UE link. In each experiment, we fixed the location of the gNB and move the UE with an increment of roughly one degrees. The table above specifies the direction of user movement with respect to gNB-UE link, distance resolution, and the number of user locations for which we conduct channel measurements. Outdoor 30 m data also contains blockage between 3.9 m to 4.8 m. At each location, we scan the transmission beam and collect data for each beam. By doing so, we can get the full OFDM channels for different locations along the moving trajectory with all the beam angles. Moreover, we use 240 kHz subcarrier spacing, which is consistent with the 5G NR numerology at FR2, so the data we collect will be a true reflection of what a 5G UE will see.

9 PAPERS • NO BENCHMARKS YET

ARCTIC

ARCTIC (Articulated Objects in Free-form Hand Interaction)

ARCTIC is a dataset of free-form interactions of hands and articulated objects. ARCTIC has 1.2M images paired with accurate 3D meshes for both hands and for objects that move and deform over time. The dataset also provides hand-object contact information.

9 PAPERS • NO BENCHMARKS YET

Datasets

2900 dataset results for English