🔔 Share your dataset with the ML community!

Filter by Modality

Filter by Task

Filter by Language (clear)

2884 dataset results for English

FoCus (Call for Customized Conversation: Customized Conversation Grounding Persona and Knowledge)

We introduce a new dataset, called FoCus, that supports knowledge-grounded answers that reflect user’s persona. One of the situations in which people need different types of knowledge, based on their preferences, occurs when they travel around the world.

14 PAPERS • NO BENCHMARKS YET

FoodSeg103

FoodSeg103 is a new food image dataset containing 7,118 images. Images are annotated with 104 ingredient classes and each image has an average of 6 ingredient labels and pixel-wise masks. It's provided as a large-scale benchmark for food image segmentation.

14 PAPERS • 1 BENCHMARK

FreebaseQA

FreebaseQA is a data set for open-domain QA over the Freebase knowledge graph. The question-answer pairs in this data set are collected from various sources, including the TriviaQA data set and other trivia websites (QuizBalls, QuizZone, KnowQuiz), and are matched against Freebase to generate relevant subject-predicate-object triples that were further verified by human annotators. As all questions in FreebaseQA are composed independently for human contestants in various trivia-like competitions, this data set shows richer linguistic variation and complexity than existing QA data sets, making it a good test-bed for emerging KB-QA systems.

14 PAPERS • NO BENCHMARKS YET

GeoS

GeoS is a dataset for automatic math problem solving. It is a dataset of SAT plane geometry questions where every question has a textual description in English accompanied by a diagram and multiple choices. Questions and answers are compiled from previous official SAT exams and practice exams offered by the College Board. We annotate ground-truth logical forms for all questions in the dataset.

14 PAPERS • 1 BENCHMARK

Humicroedit

Humicroedit is a humorous headline dataset. The data consists of regular English news headlines paired with versions of the same headlines that contain simple replacement edits designed to make them funny. The authors carefully curated crowdsourced editors to create funny headlines and judges to score a to a total of 15,095 edited headlines, with five judges per headline.

14 PAPERS • NO BENCHMARKS YET

IDRiD (Indian Diabetic Retinopathy Image Dataset)

Indian Diabetic Retinopathy Image Dataset (IDRiD) dataset consists of typical diabetic retinopathy lesions and normal retinal structures annotated at a pixel level. This dataset also provides information on the disease severity of diabetic retinopathy and diabetic macular edema for each image. This dataset is perfect for the development and evaluation of image analysis algorithms for early detection of diabetic retinopathy.

14 PAPERS • 3 BENCHMARKS

LIVECell (Label-free In Vitro image Examples of Cells)

The LIVECell (Label-free In Vitro image Examples of Cells) dataset is a large-scale microscopic image dataset for instance-segmentation of individual cells in 2D cell cultures.

14 PAPERS • 1 BENCHMARK

OASST1

OASST1 (OpenAssistant Conversations Dataset)

license: apache-2.0 tags: human-feedback size_categories: 100K<n<1M pretty_name: OpenAssistant Conversations

14 PAPERS • NO BENCHMARKS YET

PubTables-1M (PubMed Tables One Million)

The goal of PubTables-1M is to create a large, detailed, high-quality dataset for training and evaluating a wide variety of models for the tasks of table detection, table structure recognition, and functional analysis. It contains:

14 PAPERS • NO BENCHMARKS YET

SQA3D (Situated Question Answering in 3D Scenes)

SQA3D is a dataset for embodied scene understanding, where an agent needs to comprehend the scene it situates from an first person's perspective and answer questions. The questions are designed to be situated, embodied and knowledge-intensive. We offer three different modalities to represent a 3D scene: 3D scan, egocentric video and BEV picture.

14 PAPERS • 2 BENCHMARKS

STAR Benchmark (Situated Reasoning)

How to capture the present knowledge from surrounding situations and perform reasoning accordingly is crucial and challenging for machine intelligence. STAR Benchmark is a novel benchmark for Situated Reasoning, which provides 60K challenging situated questions in four types of tasks, 140K situated hypergraphs, symbolic situation programs, and logic-grounded diagnosis for real-world video situations. (Data Download, STAR Leaderboard)

14 PAPERS • 3 BENCHMARKS

Texas (48%/32%/20% fixed splits)

Node classification on Texas with the fixed 48%/32%/20% splits provided by Geom-GCN.

14 PAPERS • 2 BENCHMARKS

XLCoST (Cross-Lingual Code Snippet)

XLCoST is a benchmark dataset for cross-lingual code intelligence. The dataset contains fine-grained parallel data from 8 languages (7 commonly used programming languages and English), and supports 10 cross-language code tasks.

14 PAPERS • NO BENCHMARKS YET

ARAD-1K

ARAD-1K (Ntire 2022 spectral recovery challenge and data set)

The dataset used for NTIRE 2022 Spectral Recovery Challenge

13 PAPERS • 1 BENCHMARK

Adverse Drug Events (ADE) Corpus

Development of a benchmark corpus to support the automatic extraction of drug-related adverse effects from medical case reports.

13 PAPERS • 3 BENCHMARKS

BioLAMA

BioLAMA is a benchmark comprised of 49K biomedical factual knowledge triples for probing biomedical Language Models. It is used to assess the capabilities of Language Models for being valid biomedical knowledge bases.

13 PAPERS • 1 BENCHMARK

CholecT45

CholecT45 is a subset of CholecT50 consisting of 45 videos from the Cholec80 dataset. It is the first public release of part of CholecT50 dataset. CholecT50 is a dataset of 50 endoscopic videos of laparoscopic cholecystectomy surgery introduced to enable research on fine-grained action recognition in laparoscopic surgery. It is annotated with 100 triplet classes in the form of <instrument, verb, target>.

13 PAPERS • 2 BENCHMARKS

Enron Emails

This dataset was collected and prepared by the CALO Project (A Cognitive Assistant that Learns and Organizes). It contains data from about 150 users, mostly senior management of Enron, organized into folders. The corpus contains a total of about 0.5M messages. This data was originally made public, and posted to the web, by the Federal Energy Regulatory Commission during its investigation.

13 PAPERS • 2 BENCHMARKS

GRIT (General Robust Image Task Benchmark)

The General Robust Image Task (GRIT) Benchmark is an evaluation-only benchmark for evaluating the performance and robustness of vision systems across multiple image prediction tasks, concepts, and data sources. GRIT hopes to encourage our research community to pursue the following research directions:

13 PAPERS • 8 BENCHMARKS

GitTables

GitTables is a corpus of currently 1M relational tables extracted from CSV files in GitHub covering 96 topics. Table columns in GitTables have been annotated with more than 2K different semantic types from Schema.org and DBpedia. The column annotations consist of semantic types, hierarchical relations, range types, table domain and descriptions.

13 PAPERS • NO BENCHMARKS YET

InterHuman

InterHuman is a multimodal dataset, named InterHuman. It consists of about 107M frames for diverse two-person interactions, with accurate skeletal motions and 16,756 natural language descriptions.

13 PAPERS • 1 BENCHMARK

Kleister NDA

Kleister NDA is a dataset for Key Information Extraction (KIE). The dataset contains a mix of scanned and born-digital long formal English-language documents. For this datasets, an NLP system is expected to find or infer various types of entities by employing both textual and structural layout features. The Kleister NDA dataset has 540 Non-disclosure Agreements, with 3,229 unique pages and 2,160 entities to extract.

13 PAPERS • 1 BENCHMARK

Kvasir-Instrument

Consists of annotated frames containing GI procedure tools such as snares, balloons and biopsy forceps, etc. Beside of the images, the dataset includes ground truth masks and bounding boxes and has been verified by two expert GI endoscopists.

13 PAPERS • 3 BENCHMARKS

LIVE-FB LSVQ (LIVE-FB Large-Scale Social Video Quality (LSVQ) Database)

No-reference (NR) perceptual video quality assessment (VQA) is a complex, unsolved, and important problem to social and streaming media applications. Efficient and accurate video quality predictors are needed to monitor and guide the processing of billions of shared, often imperfect, user-generated content (UGC). Unfortunately, current NR models are limited in their prediction capabilities on real-world, "in-the-wild" UGC video data. To advance progress on this problem, we created the largest (by far) subjective video quality dataset, containing 39, 000 real-world distorted videos and 117, 000 space-time localized video patches ("v-patches"), and 5.5M human perceptual quality annotations. Using this, we created two unique NR-VQA models: (a) a local-to-global region-based NR VQA architecture (called PVQ) that learns to predict global video quality and achieves state-of-the-art performance on 3 UGC datasets, and (b) a first-of-a-kind space-time video quality mapping engine (called PVQ Ma

13 PAPERS • 1 BENCHMARK

MAP (Maybe Ambiguous Pronoun)

Maybe Ambiguous Pronoun is a dataset similar to GAP dataset, but without binary gender constraints.

13 PAPERS • NO BENCHMARKS YET

MATRES

MATRES (Multi-Axis Temporal RElations for Start-points)

This is the Multi-Axis Temporal RElations for Start-points (i.e., MATRES) dataset

13 PAPERS • 2 BENCHMARKS

MagicBrush

MagicBrush is a manually-annotated instruction-guided image editing dataset covering diverse scenarios single-turn, multi-turn, mask-provided, and mask-free editing. MagicBrush comprises 10K (source image, instruction, target image) triples, which is sufficient to train large-scale image editing models.

13 PAPERS • NO BENCHMARKS YET

Multilingual Reuters (Multilingual Reuters Collection)

The Multilingual Reuters Collection dataset comprises over 11,000 articles from six classes in five languages, i.e., English (E), French (F), German (G), Italian (I), and Spanish (S).

13 PAPERS • 1 BENCHMARK

OVAD benchmark (Open-Vocabulary Attribute Detection)

Vision-language modeling has enabled open-vocabulary tasks where predictions can be queried using any text prompt in a zero-shot manner. Existing open-vocabulary tasks focus on object classes, whereas research on object attributes is limited due to the lack of a reliable attribute-focused evaluation benchmark. This paper introduces the Open-Vocabulary Attribute Detection (OVAD) task and the corresponding OVAD benchmark. The objective of the novel task and benchmark is to probe object-level attribute information learned by vision-language models. To this end, we created a clean and densely annotated test set covering 117 attribute classes on the 80 object classes of MS COCO. It includes positive and negative annotations, which enables open-vocabulary evaluation. Overall, the benchmark consists of 1.4 million annotations. For reference, we provide a first baseline method for open-vocabulary attribute detection. Moreover, we demonstrate the benchmark's value by studying the attribute dete

13 PAPERS • 2 BENCHMARKS

Occ3D

Occ3D is a dataset for 3D occupancy prediction, which aims to estimate the detailed occupancy and semantics of objects from multi-view images. To facilitate this task, a label generation pipeline that produces dense, visibility-aware labels for a given scene. This pipeline includes point cloud aggregation, point labeling, and occlusion handling.

13 PAPERS • 1 BENCHMARK

OpinionQA

OpinionQA is a dataset for evaluating the alignment of LM opinions with those of 60 US demographic groups over topics ranging from abortion to automation.

13 PAPERS • NO BENCHMARKS YET

Pile of Law

Pile of Law is a ∼256GB (and growing) dataset of legal and administrative data which can be used for assessing norms on data sanitization across legal and administrative settings.

13 PAPERS • NO BENCHMARKS YET

ProsocialDialog

Most existing dialogue systems fail to respond properly to potentially unsafe user utterances by either ignoring or passively agreeing with them.

13 PAPERS • 1 BENCHMARK

Robust04

The goal of the Robust track is to improve the consistency of retrieval technology by focusing on poorly performing topics. In addition, the track brings back a classic, ad hoc retrieval task in TREC that provides a natural home for new participants. An ad hoc task in TREC investigates the performance of systems that search a static set of documents using previously-unseen topics. For each topic, participants create a query and submit a ranking of the top 1000 documents for that topic.

13 PAPERS • 2 BENCHMARKS

SEP-28k (Stuttering Events in Podcasts)

Stuttering Events in Podcasts (SEP-28k) is a dataset containing over 28k clips labeled with five event types including blocks, prolongations, sound repetitions, word repetitions, and interjections. Audio comes from public podcasts largely consisting of people who stutter interviewing other people who stutter.

13 PAPERS • NO BENCHMARKS YET

SPGISpeech

SPGISpeech (pronounced “speegie-speech”) is a large-scale transcription dataset, freely available for academic research. SPGISpeech is a collection of 5,000 hours of professionally-transcribed financial audio. Contrary to previous transcription datasets, SPGISpeech contains global english accents, strongly varying audio quality as well as both spontaneous and presentation style speech. The transcripts have each been cross-checked by multiple professional editors for high accuracy and are fully formatted including sentence structure and capitalization.

13 PAPERS • 1 BENCHMARK

SPoC (Pseudocode-to-Code)

Pseudocode-to-Code (SPoC) is a program synthesis dataset, containing 18,356 programs with human-authored pseudocode and test cases.

13 PAPERS • 2 BENCHMARKS

T2Dv2

The T2Dv2 dataset consists of 779 tables originating from the English-language subset of the WebTables corpus. 237 tables are annotated for the Table Type Detection task, 236 for the Columns Property Annotation (CPA) task and 235 for the Row Annotation task. The annotations that are used are DBpedia types, properties and entities.

13 PAPERS • 4 BENCHMARKS

University-1652

Contains data from three platforms, i.e., synthetic drones, satellites and ground cameras of 1,652 university buildings around the world. University-1652 is a drone-based geo-localization dataset and enables two new tasks, i.e., drone-view target localization and drone navigation.

13 PAPERS • 2 BENCHMARKS

Who-did-What (Who did What)

Who-did-What collects its corpus from news and provides options for questions similar to CBT. Each question is formed from two independent articles: an article is treated as context to be read and a separate article on the same event is used to form the query.

13 PAPERS • NO BENCHMARKS YET

xSID (Cross-lingual Slot and Intent Detection)

xSID, a new evaluation benchmark for cross-lingual (X) Slot and Intent Detection in 13 languages from 6 language families, including a very low-resource dialect, covering Arabic (ar), Chinese (zh), Danish (da), Dutch (nl), English (en), German (de), Indonesian (id), Italian (it), Japanese (ja), Kazakh (kk), Serbian (sr), Turkish (tr) and an Austro-Bavarian German dialect, South Tyrolean (de-st).

13 PAPERS • NO BENCHMARKS YET

ACL ARC

ACL Anthology Reference Corpus (ACL ARC) is a collection of 10,920 academic papers from the ACL Anthology. ACL ARC is cleaned to remove:

12 PAPERS • 4 BENCHMARKS

BUG

BUG is a large-scale gender bias dataset of 108K diverse real-world English sentences, sampled semiautomatically from large corpora using lexical syntactic pattern matching

12 PAPERS • NO BENCHMARKS YET

CICERO (Contextualized Commonsense Inference in Dialogues)

CICERO contains 53,000 inferences for five commonsense dimensions -- cause, subsequent event, prerequisite, motivation, and emotional reaction -- collected from 5600 dialogues. It involves two challenging generative and multi-choice alternative selection tasks for the state-of-the-art NLP models to solve. Download the dataset using this link.

12 PAPERS • 4 BENCHMARKS

COCO-MLT

The COCO-MLT is created from MS COCO-2017, containing 1,909 images from 80 classes. The maximum of training number per class is 1,128 and the minimum is 6. We use the test set of COCO2017 with 5,000 for evaluation. The ratio of head, medium, and tail classes is 22:33:25 in COCO-MLT.

12 PAPERS • 2 BENCHMARKS

Chaoyang

Chaoyang dataset contains 1111 normal, 842 serrated, 1404 adenocarcinoma, 664 adenoma, and 705 normal, 321 serrated, 840 adenocarcinoma, 273 adenoma samples for training and testing, respectively. This noisy dataset is constructed in the real scenario.

12 PAPERS • 2 BENCHMARKS

DiPCo (DiPCo -- Dinner Party Corpus)

We present a speech data corpus that simulates a "dinner party" scenario taking place in an everyday home environment. The corpus was created by recording multiple groups of four Amazon employee volunteers having a natural conversation in English around a dining table. The participants were recorded by a single-channel close-talk microphone and by five far-field 7-microphone array devices positioned at different locations in the recording room. The dataset contains the audio recordings and human labeled transcripts of a total of 10 sessions with a duration between 15 and 45 minutes. The corpus was created to advance in the field of noise robust and distant speech processing and is intended to serve as a public research and benchmarking data set.

12 PAPERS • NO BENCHMARKS YET

Dress Code

Dress Code is a new dataset for image-based virtual try-on composed of image pairs coming from different catalogs of YOOX NET-A-PORTER. The dataset contains more than 50k high resolution model clothing images pairs divided into three different categories (i.e. dresses, upper-body clothes, lower-body clothes).

12 PAPERS • NO BENCHMARKS YET

Datasets

2884 dataset results for English