The dataset contains 2,624 samples of $300\times300$ pixels 8-bit grayscale images of functional and defective solar cells with varying degree of degradations extracted from 44 different solar modules. The defects in the annotated images are either of intrinsic or extrinsic type and are known to reduce the power efficiency of solar modules.
12 PAPERS • NO BENCHMARKS YET
The Endomapper dataset is the first collection of complete endoscopy sequences acquired during regular medical practice, including slow and careful screening explorations, making secondary use of medical data. Its original purpose is to facilitate the development and evaluation of VSLAM (Visual Simultaneous Localization and Mapping) methods in real endoscopy data. The first release of the dataset is composed of 50 sequences with a total of more than 13 hours of video. It is also the first endoscopic dataset that includes both the computed geometric and photometric endoscope calibration as well as the original calibration videos. Meta-data and annotations associated to the dataset varies from anatomical landmark and description of the procedure labeling, tools segmentation masks, COLMAP 3D reconstructions, simulated sequences with groundtruth and meta-data related to special cases, such as sequences from the same patient. This information will improve the research in endoscopic VSLAM, a
FaVIQ (Fact Verification from Information-seeking Questions) is a challenging and realistic fact verification dataset that reflects confusions raised by real users. We use the ambiguity in information-seeking questions and their disambiguation, and automatically convert them to true and false claims. These claims are natural, and require a complete understanding of the evidence for verification. FaVIQ serves as a challenging benchmark for natural language understanding, and improves performance in professional fact checking.
FaithDial is a new benchmark for hallucination-free dialogues, by editing hallucinated responses in the Wizard of Wikipedia (WoW) benchmark.
GLGE is a general language generation evaluation benchmark which is composed of 8 language generation tasks, including Abstractive Text Summarization (CNN/DailyMail, Gigaword, XSUM, MSNews), Answer-aware Question Generation (SQuAD 1.1, MSQG), Conversational Question Answering (CoQA), and Personalizing Dialogue (Personachat).
Over the past few years, systems have been developed to control online content and eliminate abusive, offensive or hate speech content. However, people in power sometimes misuse this form of censorship to obstruct the democratic right of freedom of speech. Therefore, it is imperative that research should take a positive reinforcement approach towards online content that is encouraging, positive and supportive contents. Until now, most studies have focused on solving this problem of negativity in the English language, though the problem is much more than just harmful content. Furthermore, it is multilingual as well. Thus, we have constructed a Hope Speech dataset for Equality, Diversity and Inclusion (HopeEDI) containing user-generated comments from the social media platform YouTube with 28,451, 20,198 and 10,705 comments in English, Tamil and Malayalam, respectively, manually labelled as containing hope speech or not. To our knowledge, this is the first research of its kind to annotate
12 PAPERS • 4 BENCHMARKS
The Hutter Prize Wikipedia dataset, also known as enwiki8, is a byte-level dataset consisting of the first 100 million bytes of a Wikipedia XML dump. For simplicity we shall refer to it as a character-level dataset. Within these 100 million bytes are 205 unique tokens.
12 PAPERS • 1 BENCHMARK
L3DAS22: MACHINE LEARNING FOR 3D AUDIO SIGNAL PROCESSING This dataset supports the L3DAS22 IEEE ICASSP Gand Challenge. The challenge is supported by a Python API that facilitates the dataset download and preprocessing, the training and evaluation of the baseline models and the results submission.
LegalBench is a fascinating project that revolves around legal reasoning and evaluation. Let me break it down for you:
Lila is a unified mathematical reasoning benchmark consisting of 23 diverse tasks along four dimensions: (i) mathematical abilities e.g., arithmetic, calculus (ii) language format e.g., question-answering, fill-in-the-blanks (iii) language diversity e.g., no language, simple language (iv) external knowledge e.g., commonsense, physics. The benchmark is constructed by extending 20 datasets benchmark by collecting task instructions and solutions in the form of Python programs, thereby obtaining explainable solutions in addition to the correct answer.
Multimodal C4 (MMC4) is an augmentation of the popular text-only c4 corpus with images interleaved. The corpus contains 103M documents containing 585M images interleaved with 43B English tokens.
Existing hate speech datasets contain only textual data. We create a new manually annotated multimodal hate speech dataset formed by 150,000 tweets, each one of them containing text and an image. We call the dataset MMHS150K.
The NewSHead dataset contains 369,940 English stories with 932,571 unique URLs, among which there are 359,940 stories for training, 5,000 for validation, and 5,000 for testing, respectively. Each news story contains at least three (and up to five) articles.
PARANMT-50M is a dataset for training paraphrastic sentence embeddings. It consists of more than 50 million English-English sentential paraphrase pairs.
A dataset of color images corrupted by natural noise due to low-light conditions, together with spatially and intensity-aligned low noise images of the same scenes.
Next generation task-oriented dialog systems need to understand conversational contexts with their perceived surroundings, to effectively help users in the real-world multimodal environment. Existing task-oriented dialog datasets aimed towards virtual assistance fall short and do not situate the dialog in the user's multimodal context. To overcome, we present a new dataset for Situated and Interactive Multimodal Conversations, SIMMC 2.0, which includes 11K task-oriented user<->assistant dialogs (117K utterances) in the shopping domain, grounded in immersive and photo-realistic scenes. The dialogs are collected using a two-phase pipeline: (1) A novel multimodal dialog simulator generates simulated dialog flows, with an emphasis on diversity and richness of interactions, (2) Manual paraphrasing of the generated utterances to collect diverse referring expressions. We provide an in-depth analysis of the collected dataset, and describe in detail the four main benchmark tasks we propose. Our
12 PAPERS • 2 BENCHMARKS
The SUN-SEG dataset is a high-quality per-frame annotated VPS dataset, which includes 158,690 frames from the famous SUN dataset. It extends the labels with diverse types, i.e., object mask, boundary, scribble, polygon, and visual attribute. It also introduces the pathological information from the original SUN dataset, including pathological classification labels, location information, and shape information.
We construct the long-tailed version of VOC from its 2012 train-val set. It contains 1,142 images from 20 classes, with a maximum of 775 images per class and a minimum of 4 images per class. The ratio of head, medium, and tail classes after splitting is 6:6:8. We evaluate the performance on VOC2007 test set with 4952 images.
VidSitu is a dataset for the task of semantic role labeling in videos (VidSRL). It is a large-scale video understanding data source with 29K 10-second movie clips richly annotated with a verb and semantic-roles every 2 seconds. Entities are co-referenced across events within a movie clip and events are connected to each other via event-event relations. Clips in VidSitu are drawn from a large collection of movies (∼3K) and have been chosen to be both complex (∼4.2 unique verbs within a video) as well as diverse (∼200 verbs have more than 100 annotations each).
speechocean762 is an open-source speech corpus designed for pronunciation assessment use, consisting of 5000 English utterances from 250 non-native speakers, where half of the speakers are children. Five experts annotated each of the utterances at sentence-level, word-level and phoneme-level. This corpus is allowed to be used freely for commercial and non-commercial purposes. To avoid subjective bias, each expert scores independently under the same metric
12 PAPERS • 3 BENCHMARKS
This dataset has 20 classes and each class has about 1000 documents. The data split for train/validation/test is 1600/200/200. We created 10 tasks, 2 classes per task. Since this is topic-based text classification data, the classes are very different and have little shared knowledge. As mentioned above, this application (and dataset) is mainly used to show a CL model's ability to overcome forgetting. Detailed statistics please on page https://github.com/ZixuanKe/PyContinual
11 PAPERS • 1 BENCHMARK
A set of 19 ASC datasets (reviews of 19 products) producing a sequence of 19 tasks. Each dataset represents a task. The datasets are from 4 sources: (1) HL5Domains (Hu and Liu, 2004) with reviews of 5 products; (2) Liu3Domains (Liu et al., 2015) with reviews of 3 products; (3) Ding9Domains (Ding et al., 2008) with reviews of 9 products; and (4) SemEval14 with reviews of 2 products - SemEval 2014 Task 4 for laptop and restaurant. For (1), (2) and (3), we split about 10% of the original data as the validate data, another about 10% of the original data as the testing data. For (4), We use 150 examples from the training set for validation. To be consistent with existing research(Tang et al., 2016), examples belonging to the conflicting polarity (both positive and negative sentiments are expressed about an aspect term) are not used. Statistics and details of the 19 datasets are given on Page https://github.com/ZixuanKe/PyContinual.
This paper introduces the Broad Twitter Corpus (BTC), which is not only significantly bigger, but sampled across different regions, temporal periods, and types of Twitter users. The gold-standard named entity annotations are made by a combination of NLP experts and crowd workers, which enables us to harness crowd recall while maintaining high quality. We also measure the entity drift observed in our dataset (i.e. how entity representation varies over time), and compare to newswire.
11 PAPERS • 2 BENCHMARKS
The Cornell eRulemaking Corpus – CDCP is an argument mining corpus annotated with argumentative structure information capturing the evaluability of arguments. The corpus consists of 731 user comments on Consumer Debt Collection Practices (CDCP) rule by the Consumer Financial Protection Bureau (CFPB); the resulting dataset contains 4931 elementary unit and 1221 support relation annotations. It is a resource for building argument mining systems that can not only extract arguments from unstructured text, but also identify what additional information is necessary for readers to understand and evaluate a given argument. Immediate applications include providing real-time feedback to commenters, specifying which types of support for which propositions can be added to construct better-formed arguments.
11 PAPERS • 3 BENCHMARKS
The BioCreative V CDR task corpus is manually annotated for chemicals, diseases and chemical-induced disease (CID) relations. It contains the titles and abstracts of 1500 PubMed articles and is split into equally sized train, validation and test sets. It is common to first tune a model on the validation set and then train on the combination of the train and validation sets before evaluating on the test set. It is also common to filter negative relations with disease entities that are hypernyms of a corresponding true relations disease entity within the same abstract (see Appendix C of this paper for details).
Along with COVID-19 pandemic we are also fighting an `infodemic'. Fake news and rumors are rampant on social media. Believing in rumors can cause significant harm. This is further exacerbated at the time of a pandemic. To tackle this, we curate and release a manually annotated dataset of 10,700 social media posts and articles of real and fake news on COVID-19. We benchmark the annotated dataset with four machine learning baselines - Decision Tree, Logistic Regression , Gradient Boost , and Support Vector Machine (SVM). We obtain the best performance of 93.46\% F1-score with SVM.
CrossNER is a cross-domain NER (Named Entity Recognition) dataset, a fully-labeled collection of NER data spanning over five diverse domains (Politics, Natural Science, Music, Literature, and Artificial Intelligence) with specialized entity categories for different domains. Additionally, CrossNER also includes unlabeled domain-related corpora for the corresponding five domains.
The DSTC7 Task 1 dataset is a dataset and task for goal-oriented dialogue. The data originates from human-human conversations, which is built from online resources, specifically the Ubuntu Internet Relay Chat (IRC) channel and an Advising dataset from the University of Michigan.
FoolMeTwice (FM2 for short) is a large dataset of challenging entailment pairs collected through a fun multi-player game. Gamification encourages adversarial examples, drastically lowering the number of examples that can be solved using "shortcuts" compared to other popular entailment datasets. Players are presented with two tasks. The first task asks the player to write a plausible claim based on the evidence from a Wikipedia page. The second one shows two plausible claims written by other players, one of which is false, and the goal is to identify it before the time runs out. Players "pay" to see clues retrieved from the evidence pool: the more evidence the player needs, the harder the claim. Game-play between motivated players leads to diverse strategies for crafting claims, such as temporal inference and diverting to unrelated evidence, and results in higher quality data for the entailment and evidence retrieval tasks.
GooAQ is a large-scale dataset with a variety of answer types. This dataset contains over 5 million questions and 3 million answers collected from Google. GooAQ questions are collected semi-automatically from the Google search engine using its autocomplete feature. This results in naturalistic questions of practical interest that are nonetheless short and expressed using simple language. GooAQ answers are mined from Google's responses to the collected questions, specifically from the answer boxes in the search results. This yields a rich space of answer types, containing both textual answers (short and long) as well as more structured ones such as collections.
11 PAPERS • NO BENCHMARKS YET
The Respiratory Sound database was originally compiled to support the scientific challenge organized at Int. Conf. on Biomedical Health Informatics - ICBHI 2017.
One large-scale database for Text-to-Image Person Re-identification, i.e., Text-based Person Retrieval.
This dataset evaluates instruction following ability of large language models. There are 500+ prompts with instructions such as "write an article with more than 800 words", "wrap your response with double quotation marks", etc.
Dataset Introduction
The N-ImageNet dataset is an event-camera counterpart for the ImageNet dataset. The dataset is obtained by moving an event camera around a monitor displaying images from ImageNet. N-ImageNet contains approximately 1,300k training samples and 50k validation samples. In addition, the dataset also contains variants of the validation dataset recorded under a wide range of lighting or camera trajectories. Additional details about the dataset are explained in the paper available through this link. Please cite this paper if you make use of the dataset.
Ohsumed includes medical abstracts from the MeSH categories of the year 1991. In [Joachims, 1997] were used the first 20,000 documents divided in 10,000 for training and 10,000 for testing. The specific task was to categorize the 23 cardiovascular diseases categories. After selecting the such category subset, the unique abstract number becomes 13,929 (6,286 for training and 7,643 for testing). As current computers can easily manage larger number of documents we make available all 34,389 cardiovascular diseases abstracts out of 50,216 medical abstracts contained in the year 1991.
OpenXAI is the first general-purpose lightweight library that provides a comprehensive list of functions to systematically evaluate the quality of explanations generated by attribute-based explanation methods. OpenXAI supports the development of new datasets (both synthetic and real-world) and explanation methods, with a strong bent towards promoting systematic, reproducible, and transparent evaluation of explanation methods.
P-DukeMTMC-reID is a modified version based on DukeMTMC-reID dataset. There are 12,927 images (665 identifies) in training set, 2,163 images (634 identities) for querying and 9,053 images in the gallery set.
PIE-Bench comprises 700 images featuring 10 distinct editing types. Images are evenly distributed in natural and artificial scenes (e.g., paintings) among four categories: animal, human, indoor, and outdoor. Each image in PIE-Bench includes five annotations: source image prompt, target image prompt, editing instruction, main editing body, and the editing mask. Notably, the editing mask annotation (indicating the anticipated editing region) is crucial in accurate metrics computations as we expect the editing to only occur within a designated area.
ParCorFull is a parallel corpus annotated with full coreference chains that has been created to address an important problem that machine translation and other multilingual natural language processing (NLP) technologies face -- translation of coreference across languages. This corpus contains parallel texts for the language pair English-German, two major European languages. Despite being typologically very close, these languages still have systemic differences in the realisation of coreference, and thus pose problems for multilingual coreference resolution and machine translation. This parallel corpus covers the genres of planned speech (public lectures) and newswire. It is richly annotated for coreference in both languages, including annotation of both nominal coreference and reference to antecedents expressed as clauses, sentences and verb phrases.
Project CodeNet is a large-scale dataset with approximately 14 million code samples, each of which is an intended solution to one of 4000 coding problems. The code samples are written in over 50 programming languages (although the dominant languages are C++, C, Python, and Java) and they are annotated with a rich set of information, such as its code size, memory footprint, cpu run time, and status, which indicates acceptance or error types. The dataset is accompanied by a repository, where we provide a set of tools to aggregate codes samples based on user criteria and to transform code samples into token sequences, simplified parse trees and other code graphs. A detailed discussion of Project CodeNet is available in this paper.
QAMPARI is an ODQA benchmark, where question answers are lists of entities, spread across many paragraphs. It was created by (a) generating questions with multiple answers from Wikipedia's knowledge graph and tables, (b) automatically pairing answers with supporting evidence in Wikipedia paragraphs, and (c) manually paraphrasing questions and validating each answer.
Tasks Our shared task has three subtasks. Subtask 1 and 2 focus on evaluating machine learning models' performance with regard to two definitions of abstractness (Spreen and Schulz, 1966; Changizi, 2008), which we call imperceptibility and nonspecificity, respectively. Subtask 3 aims to provide some insights to their relationships.
End-to-End Low Cost Compressive Spectral Imaging with Spatial-Spectral Self-Attention
The StarCraft Multi-Agent Challenges+ requires agents to learn completion of multi-stage tasks and usage of environmental factors without precise reward functions. The previous challenges (SMAC) recognized as a standard benchmark of Multi-Agent Reinforcement Learning are mainly concerned with ensuring that all agents cooperatively eliminate approaching adversaries only through fine manipulation with obvious reward functions. This challenge, on the other hand, is interested in the exploration capability of MARL algorithms to efficiently learn implicit multi-stage tasks and environmental factors as well as micro-control. This study covers both offensive and defensive scenarios. In the offensive scenarios, agents must learn to first find opponents and then eliminate them. The defensive scenarios require agents to use topographic features. For example, agents need to position themselves behind protective structures to make it harder for enemies to attack.
11 PAPERS • 13 BENCHMARKS
SODA is a high-quality social dialogue dataset. In contrast to most existing crowdsourced, small-scale dialogue corpora, Soda distills 1.5M socially-grounded dialogues from a pre-trained language model (InstructGPT; Ouyang et al., ). Dialogues are distilled by contextualizing social commonsense knowledge from a knowledge graph (Atomic10x).
StylePTB is a fine-grained text style transfer benchmark. It consists of paired sentences undergoing 21 fine-grained stylistic changes spanning atomic lexical, syntactic, semantic, and thematic transfers of text, as well as compositions of multiple transfers which allow modelling of fine-grained stylistic changes as building blocks for more complex, high-level transfers.