We introduce a new dataset, called FoCus, that supports knowledge-grounded answers that reflect user’s persona. One of the situations in which people need different types of knowledge, based on their preferences, occurs when they travel around the world.
14 PAPERS • NO BENCHMARKS YET
FoodSeg103 is a new food image dataset containing 7,118 images. Images are annotated with 104 ingredient classes and each image has an average of 6 ingredient labels and pixel-wise masks. It's provided as a large-scale benchmark for food image segmentation.
14 PAPERS • 1 BENCHMARK
FreebaseQA is a data set for open-domain QA over the Freebase knowledge graph. The question-answer pairs in this data set are collected from various sources, including the TriviaQA data set and other trivia websites (QuizBalls, QuizZone, KnowQuiz), and are matched against Freebase to generate relevant subject-predicate-object triples that were further verified by human annotators. As all questions in FreebaseQA are composed independently for human contestants in various trivia-like competitions, this data set shows richer linguistic variation and complexity than existing QA data sets, making it a good test-bed for emerging KB-QA systems.
GeoS is a dataset for automatic math problem solving. It is a dataset of SAT plane geometry questions where every question has a textual description in English accompanied by a diagram and multiple choices. Questions and answers are compiled from previous official SAT exams and practice exams offered by the College Board. We annotate ground-truth logical forms for all questions in the dataset.
Humicroedit is a humorous headline dataset. The data consists of regular English news headlines paired with versions of the same headlines that contain simple replacement edits designed to make them funny. The authors carefully curated crowdsourced editors to create funny headlines and judges to score a to a total of 15,095 edited headlines, with five judges per headline.
Indian Diabetic Retinopathy Image Dataset (IDRiD) dataset consists of typical diabetic retinopathy lesions and normal retinal structures annotated at a pixel level. This dataset also provides information on the disease severity of diabetic retinopathy and diabetic macular edema for each image. This dataset is perfect for the development and evaluation of image analysis algorithms for early detection of diabetic retinopathy.
14 PAPERS • 3 BENCHMARKS
The LIVECell (Label-free In Vitro image Examples of Cells) dataset is a large-scale microscopic image dataset for instance-segmentation of individual cells in 2D cell cultures.
license: apache-2.0 tags: human-feedback size_categories: 100K<n<1M pretty_name: OpenAssistant Conversations
The goal of PubTables-1M is to create a large, detailed, high-quality dataset for training and evaluating a wide variety of models for the tasks of table detection, table structure recognition, and functional analysis. It contains:
SQA3D is a dataset for embodied scene understanding, where an agent needs to comprehend the scene it situates from an first person's perspective and answer questions. The questions are designed to be situated, embodied and knowledge-intensive. We offer three different modalities to represent a 3D scene: 3D scan, egocentric video and BEV picture.
14 PAPERS • 2 BENCHMARKS
How to capture the present knowledge from surrounding situations and perform reasoning accordingly is crucial and challenging for machine intelligence. STAR Benchmark is a novel benchmark for Situated Reasoning, which provides 60K challenging situated questions in four types of tasks, 140K situated hypergraphs, symbolic situation programs, and logic-grounded diagnosis for real-world video situations. (Data Download, STAR Leaderboard)
Node classification on Texas with the fixed 48%/32%/20% splits provided by Geom-GCN.
XLCoST is a benchmark dataset for cross-lingual code intelligence. The dataset contains fine-grained parallel data from 8 languages (7 commonly used programming languages and English), and supports 10 cross-language code tasks.
The dataset used for NTIRE 2022 Spectral Recovery Challenge
13 PAPERS • 1 BENCHMARK
Development of a benchmark corpus to support the automatic extraction of drug-related adverse effects from medical case reports.
13 PAPERS • 3 BENCHMARKS
BioLAMA is a benchmark comprised of 49K biomedical factual knowledge triples for probing biomedical Language Models. It is used to assess the capabilities of Language Models for being valid biomedical knowledge bases.
CholecT45 is a subset of CholecT50 consisting of 45 videos from the Cholec80 dataset. It is the first public release of part of CholecT50 dataset. CholecT50 is a dataset of 50 endoscopic videos of laparoscopic cholecystectomy surgery introduced to enable research on fine-grained action recognition in laparoscopic surgery. It is annotated with 100 triplet classes in the form of <instrument, verb, target>.
13 PAPERS • 2 BENCHMARKS
This dataset was collected and prepared by the CALO Project (A Cognitive Assistant that Learns and Organizes). It contains data from about 150 users, mostly senior management of Enron, organized into folders. The corpus contains a total of about 0.5M messages. This data was originally made public, and posted to the web, by the Federal Energy Regulatory Commission during its investigation.
The General Robust Image Task (GRIT) Benchmark is an evaluation-only benchmark for evaluating the performance and robustness of vision systems across multiple image prediction tasks, concepts, and data sources. GRIT hopes to encourage our research community to pursue the following research directions:
13 PAPERS • 8 BENCHMARKS
GitTables is a corpus of currently 1M relational tables extracted from CSV files in GitHub covering 96 topics. Table columns in GitTables have been annotated with more than 2K different semantic types from Schema.org and DBpedia. The column annotations consist of semantic types, hierarchical relations, range types, table domain and descriptions.
13 PAPERS • NO BENCHMARKS YET
InterHuman is a multimodal dataset, named InterHuman. It consists of about 107M frames for diverse two-person interactions, with accurate skeletal motions and 16,756 natural language descriptions.
Kleister NDA is a dataset for Key Information Extraction (KIE). The dataset contains a mix of scanned and born-digital long formal English-language documents. For this datasets, an NLP system is expected to find or infer various types of entities by employing both textual and structural layout features. The Kleister NDA dataset has 540 Non-disclosure Agreements, with 3,229 unique pages and 2,160 entities to extract.
Consists of annotated frames containing GI procedure tools such as snares, balloons and biopsy forceps, etc. Beside of the images, the dataset includes ground truth masks and bounding boxes and has been verified by two expert GI endoscopists.
No-reference (NR) perceptual video quality assessment (VQA) is a complex, unsolved, and important problem to social and streaming media applications. Efficient and accurate video quality predictors are needed to monitor and guide the processing of billions of shared, often imperfect, user-generated content (UGC). Unfortunately, current NR models are limited in their prediction capabilities on real-world, "in-the-wild" UGC video data. To advance progress on this problem, we created the largest (by far) subjective video quality dataset, containing 39, 000 real-world distorted videos and 117, 000 space-time localized video patches ("v-patches"), and 5.5M human perceptual quality annotations. Using this, we created two unique NR-VQA models: (a) a local-to-global region-based NR VQA architecture (called PVQ) that learns to predict global video quality and achieves state-of-the-art performance on 3 UGC datasets, and (b) a first-of-a-kind space-time video quality mapping engine (called PVQ Ma
Maybe Ambiguous Pronoun is a dataset similar to GAP dataset, but without binary gender constraints.
This is the Multi-Axis Temporal RElations for Start-points (i.e., MATRES) dataset
MagicBrush is a manually-annotated instruction-guided image editing dataset covering diverse scenarios single-turn, multi-turn, mask-provided, and mask-free editing. MagicBrush comprises 10K (source image, instruction, target image) triples, which is sufficient to train large-scale image editing models.
The Multilingual Reuters Collection dataset comprises over 11,000 articles from six classes in five languages, i.e., English (E), French (F), German (G), Italian (I), and Spanish (S).
Vision-language modeling has enabled open-vocabulary tasks where predictions can be queried using any text prompt in a zero-shot manner. Existing open-vocabulary tasks focus on object classes, whereas research on object attributes is limited due to the lack of a reliable attribute-focused evaluation benchmark. This paper introduces the Open-Vocabulary Attribute Detection (OVAD) task and the corresponding OVAD benchmark. The objective of the novel task and benchmark is to probe object-level attribute information learned by vision-language models. To this end, we created a clean and densely annotated test set covering 117 attribute classes on the 80 object classes of MS COCO. It includes positive and negative annotations, which enables open-vocabulary evaluation. Overall, the benchmark consists of 1.4 million annotations. For reference, we provide a first baseline method for open-vocabulary attribute detection. Moreover, we demonstrate the benchmark's value by studying the attribute dete
Occ3D is a dataset for 3D occupancy prediction, which aims to estimate the detailed occupancy and semantics of objects from multi-view images. To facilitate this task, a label generation pipeline that produces dense, visibility-aware labels for a given scene. This pipeline includes point cloud aggregation, point labeling, and occlusion handling.
OpinionQA is a dataset for evaluating the alignment of LM opinions with those of 60 US demographic groups over topics ranging from abortion to automation.
Pile of Law is a ∼256GB (and growing) dataset of legal and administrative data which can be used for assessing norms on data sanitization across legal and administrative settings.
Most existing dialogue systems fail to respond properly to potentially unsafe user utterances by either ignoring or passively agreeing with them.
The goal of the Robust track is to improve the consistency of retrieval technology by focusing on poorly performing topics. In addition, the track brings back a classic, ad hoc retrieval task in TREC that provides a natural home for new participants. An ad hoc task in TREC investigates the performance of systems that search a static set of documents using previously-unseen topics. For each topic, participants create a query and submit a ranking of the top 1000 documents for that topic.
Stuttering Events in Podcasts (SEP-28k) is a dataset containing over 28k clips labeled with five event types including blocks, prolongations, sound repetitions, word repetitions, and interjections. Audio comes from public podcasts largely consisting of people who stutter interviewing other people who stutter.
SPGISpeech (pronounced “speegie-speech”) is a large-scale transcription dataset, freely available for academic research. SPGISpeech is a collection of 5,000 hours of professionally-transcribed financial audio. Contrary to previous transcription datasets, SPGISpeech contains global english accents, strongly varying audio quality as well as both spontaneous and presentation style speech. The transcripts have each been cross-checked by multiple professional editors for high accuracy and are fully formatted including sentence structure and capitalization.
Pseudocode-to-Code (SPoC) is a program synthesis dataset, containing 18,356 programs with human-authored pseudocode and test cases.
The T2Dv2 dataset consists of 779 tables originating from the English-language subset of the WebTables corpus. 237 tables are annotated for the Table Type Detection task, 236 for the Columns Property Annotation (CPA) task and 235 for the Row Annotation task. The annotations that are used are DBpedia types, properties and entities.
13 PAPERS • 4 BENCHMARKS
Contains data from three platforms, i.e., synthetic drones, satellites and ground cameras of 1,652 university buildings around the world. University-1652 is a drone-based geo-localization dataset and enables two new tasks, i.e., drone-view target localization and drone navigation.
Who-did-What collects its corpus from news and provides options for questions similar to CBT. Each question is formed from two independent articles: an article is treated as context to be read and a separate article on the same event is used to form the query.
xSID, a new evaluation benchmark for cross-lingual (X) Slot and Intent Detection in 13 languages from 6 language families, including a very low-resource dialect, covering Arabic (ar), Chinese (zh), Danish (da), Dutch (nl), English (en), German (de), Indonesian (id), Italian (it), Japanese (ja), Kazakh (kk), Serbian (sr), Turkish (tr) and an Austro-Bavarian German dialect, South Tyrolean (de-st).
ACL Anthology Reference Corpus (ACL ARC) is a collection of 10,920 academic papers from the ACL Anthology. ACL ARC is cleaned to remove:
12 PAPERS • 4 BENCHMARKS
BUG is a large-scale gender bias dataset of 108K diverse real-world English sentences, sampled semiautomatically from large corpora using lexical syntactic pattern matching
12 PAPERS • NO BENCHMARKS YET
CICERO contains 53,000 inferences for five commonsense dimensions -- cause, subsequent event, prerequisite, motivation, and emotional reaction -- collected from 5600 dialogues. It involves two challenging generative and multi-choice alternative selection tasks for the state-of-the-art NLP models to solve. Download the dataset using this link.
The COCO-MLT is created from MS COCO-2017, containing 1,909 images from 80 classes. The maximum of training number per class is 1,128 and the minimum is 6. We use the test set of COCO2017 with 5,000 for evaluation. The ratio of head, medium, and tail classes is 22:33:25 in COCO-MLT.
12 PAPERS • 2 BENCHMARKS
Chaoyang dataset contains 1111 normal, 842 serrated, 1404 adenocarcinoma, 664 adenoma, and 705 normal, 321 serrated, 840 adenocarcinoma, 273 adenoma samples for training and testing, respectively. This noisy dataset is constructed in the real scenario.
We present a speech data corpus that simulates a "dinner party" scenario taking place in an everyday home environment. The corpus was created by recording multiple groups of four Amazon employee volunteers having a natural conversation in English around a dining table. The participants were recorded by a single-channel close-talk microphone and by five far-field 7-microphone array devices positioned at different locations in the recording room. The dataset contains the audio recordings and human labeled transcripts of a total of 10 sessions with a duration between 15 and 45 minutes. The corpus was created to advance in the field of noise robust and distant speech processing and is intended to serve as a public research and benchmarking data set.
Dress Code is a new dataset for image-based virtual try-on composed of image pairs coming from different catalogs of YOOX NET-A-PORTER. The dataset contains more than 50k high resolution model clothing images pairs divided into three different categories (i.e. dresses, upper-body clothes, lower-body clothes).