🔔 Share your dataset with the ML community!

Filter by Modality (clear)

Filter by Task

Filter by Language (clear)

110 dataset results for Tabular AND English

Can you predict product backorder?

Problem Statement

1 PAPER • NO BENCHMARKS YET

Chicago Face Database (CFD)

"The Chicago Face Database was developed at the University of Chicago by Debbie S. Ma, Joshua Correll, and Bernd Wittenbrink. The CFD is intended for use in scientific research. It provides high-resolution, standardized photographs of male and female faces of varying ethnicity between the ages of 17-65. Extensive norming data are available for each individual model. These data include both physical attributes (e.g., face size) as well as subjective ratings by independent judges (e.g., attractiveness).

1 PAPER • NO BENCHMARKS YET

Citations to invalid DOI-identified entities obtained from processing DOI-to-DOI citations to add in COCI

This dataset contains a two-column CSV file, where the first column ("Valid_citing_DOI") contains the DOI of a citing entity retrieved in Crossref, while the second column ("Invalid_cited_DOI") contains the invalid DOI of a cited entity identified by looking at the field "reference" in the JSON document returned by querying the Crossref API with the citing DOI.

1 PAPER • NO BENCHMARKS YET

Concerns and Value Judgments of Stakeholders in the Non-Fungible Tokens (NFTs) Market

Concerns and Value Judgments of Stakeholders in the Non-Fungible Tokens (NFTs) Market (Replication Data for: "Centralized or Decentralized?")

Click to add a brief description of the dataset (Markdown and LaTeX enabled).

1 PAPER • NO BENCHMARKS YET

The dataset contains 30 million cryptocurrency-related tweets from 10.10.2020 to 3.3.2021. See https://github.com/meakbiyik/ask-who-not-what for more details.

1 PAPER • NO BENCHMARKS YET

DBFC Dataset (Single Direct Borohydride Fuel Cell Dataset)

This dataset includes Direct Borohydride Fuel Cell (DBFC) impedance and polarization test in anode with Pd/C, Pt/C and Pd decorated Ni–Co/rGO catalysts. In fact, different concentration of Sodium Borohydride (SBH), applied voltages and various anode catalysts loading with explanation of experimental details of electrochemical analysis are considered in data. Voltage, power density and resistance of DBFC change as a function of weight percent of SBH (%), applied voltage and amount of anode catalyst loading that are evaluated by polarization and impedance curves with using appropriate equivalent circuit of fuel cell. Can be stated that interpretation of electrochemical behavior changes by the data of related cell is inevitable, which can be useful in simulation, power source investigation and depth analysis in DB fuel cell researches.

1 PAPER • NO BENCHMARKS YET

DEAP City Dataset

Main Dataset city_pollution_data.csv

1 PAPER • NO BENCHMARKS YET

Dataset of Paper Corpus

Overview of the scoping review paper corpus, sorted by their diferent intent types, categories, and subcategories. Note: Papers (77) may include multiple unique intents (172) and can therefore appear in multiple categories and subcategories.

1 PAPER • NO BENCHMARKS YET

Dataset of a Study of Computational reproducibility of Jupyter notebooks from biomedical publications

The dataset is generated from the study of computational reproducibility of Jupyter notebooks from biomedical publications. Our focus lies in evaluating the extent of reproducibility of Jupyter notebooks derived from GitHub repositories linked to publications present in the biomedical literature repository, PubMed Central. We analyzed the reproducibility of Jupyter notebooks from GitHub repositories associated with publications indexed in the biomedical literature repository PubMed Central. The dataset includes the metadata information of the journals, publications, the Github repositories mentioned in the publications and the notebooks present in the Github repositories.

1 PAPER • NO BENCHMARKS YET

Dataset of a Study of Computational reproducibility of Jupyter notebooks from biomedical publications version 1

Dataset of a Study of Computational reproducibility of Jupyter notebooks from biomedical publications version 1 (Version 1)

This repository contains the dataset for the study of the computational reproducibility of Jupyter notebooks from biomedical publications. We analyzed the reproducibility of Jupyter notebooks from GitHub repositories associated with publications indexed in the biomedical literature repository PubMed Central. The dataset includes the metadata information of the journals, publications, the Github repositories mentioned in the publications and the notebooks present in the Github repositories.

1 PAPER • NO BENCHMARKS YET

DocBank-TB

DocBank-TB (DocBank-Table)

This dataset consisting 500 set of caption, table and coresponding paper page, processed from DocBank.

1 PAPER • NO BENCHMARKS YET

Drosophila Immunity Time-Course Data

The data used for all results in this paper can be found here. This directory contains:

1 PAPER • NO BENCHMARKS YET

EUCA dataset

EUCA dataset description Associated Paper: EUCA: the End-User-Centered Explainable AI Framework

1 PAPER • NO BENCHMARKS YET

EUEN17037_Daylight_and_View_Standard_TestDataSet

EUEN17037 Daylight and View Standard Test Dataset.

1 PAPER • NO BENCHMARKS YET

EVI

The EVI dataset is a challenging, multilingual spoken-dialogue dataset with 5,506 dialogues in English, Polish, and French. The dataset can be used to develop and benchmark conversational systems for user authentication tasks, i.e. speaker enrolment (E), speaker verification (V), speaker identification (I).

1 PAPER • 3 BENCHMARKS

Error Grids for multi-fidelity benchmark functions in mf2

Provide:

1 PAPER • NO BENCHMARKS YET

EyeInfo

The EyeInfo Dataset is an open-source eye-tracking dataset created by Fabricio Batista Narcizo, a research scientist at the IT University of Copenhagen (ITU) and GN Audio A/S (Jabra), Denmark. This dataset was introduced in the paper "High-Accuracy Gaze Estimation for Interpolation-Based Eye-Tracking Methods" (DOI: 10.3390/vision5030041). The dataset contains high-speed monocular eye-tracking data from an off-the-shelf remote eye tracker using active illumination. The data from each user has a text file with data annotations of eye features, environment, viewed targets, and facial features. This dataset follows the principles of the General Data Protection Regulation (GDPR).

1 PAPER • NO BENCHMARKS YET

FICS PCB Image Collection (FPIC)

Optical images of printed circuit boards as well as detailed annotations of any text, logos, and surface-mount devices (SMDs). There are several hundred samples spanning a wide variety of manufacturing locations, sizes, node technology, applications, and more.

1 PAPER • NO BENCHMARKS YET

FinBench

FinBench is a benchmark for evaluating the performance of machine learning models with both tabular data inputs and profile text inputs.

1 PAPER • NO BENCHMARKS YET

GRD-TRT-BUF-4I Technical Validation Data

This is the static test data from the study "Global Geolocated Realtime Data of Interfleet Urban Transit Bus Iding" collected by GRD-TRT-BUF-4I. test-data-a.csv was collected from December 31, 2023 00:01:30 UTC to January 1, 2024 00:01:30 UTC. test-data-b.csv was collected from January 4, 2024 01:30:30 UTC to January 5, 2024 01:30:30 UTC. test-data-c.csv was collected from January 10, 2024 16:05:30 UTC to January 11, 2024 16:05:30 UTC.

1 PAPER • NO BENCHMARKS YET

Harmonized US National Health and Nutrition Examination Survey (NHANES) 1988-2018

The National Health and Nutrition Examination Survey (NHANES) provides data on the health and environmental exposure of the non-institutionalized US population. Such data have considerable potential to understand how the environment and behaviors impact human health. These data are also currently leveraged to answer public health questions such as prevalence of disease. However, these data need to first be processed before new insights can be derived through large-scale analyses. NHANES data are stored across hundreds of files with multiple inconsistencies. Correcting such inconsistencies takes systematic cross examination and considerable efforts but is required for accurately and reproducibly characterizing the associations between the exposome and diseases (e.g., cancer mortality outcomes). Thus, we developed a set of curated and unified datasets and accompanied code by merging 614 separate files and harmonizing unrestricted data across NHANES III (1988-1994) and Continuous (1999-20

1 PAPER • NO BENCHMARKS YET

Heteroatom Doped Graphene Supercapacitor

Heteroatom doped graphene supercapacitor feature data is gathered from various literatures for use in machine learning tasks. Main motivation is to optimize supercapacitors and to gain knowledge into models for electrochemistry tasks.

1 PAPER • NO BENCHMARKS YET

ICLR Database (ICLR Database (with Textual Covariates))

A maintained database tracks ICLR submissions and reviews, augmented with author profiles and higher-level textual features.

1 PAPER • NO BENCHMARKS YET

IEIs

IEIs (Ion and Electron Insulators)

We would like to introduce three types of ion and electron insulators, i.e. Li-ion & electron insulators (LEIs), Na-ion & electron insulators (NEIs), and K-ion & electron insulators (KEIs), and provide a set of codes here to screen candidate materials from computational material database, Materials Project. The IEI materials are able to block the transport of multiple charge carriers (ions and electrons) and stay thermodynamically stable against specific alkali-metals. The screening workflows and usage of IEI materials in rechargeable solid-state Li/Na/K metal batteries are presented in the paper below.

1 PAPER • NO BENCHMARKS YET

Knowledge Graph Maturity Model

Click to add a brief description of the dataset (Markdown and LaTeX enabled).

1 PAPER • NO BENCHMARKS YET

LinkedResults

The LinkedResults dataset contains around 1,600 results capturing performance of machine learning models from tables of 239 papers. All tables come from a subset of SegmentedTables dataset. Each result is a tuple of form (task, dataset, metric name, metric value) and is linked to a particular table, row and cell it originates from.

1 PAPER • NO BENCHMARKS YET

List of OWL reasoners

CSV file with a list of all examined OWL reasoners. For each item, information on usability and maintenance status, project pages, source code repositories and related documentation was gathered.

1 PAPER • NO BENCHMARKS YET

MIMI dataset

MIMI dataset (Multi-aspect Integrated Migration Indicators dataset)

Nowadays, new branches of research are proposing the use of non-traditional data sources for the study of migration trends in order to find an original methodology to answer open questions about cross-border human mobility. The Multi-aspect Integrated Migration Indicators (MIMI) dataset is a new dataset to be exploited in migration studies as a concrete example of this new approach. It includes both official data about bidirectional human migration (traditional flow and stock data) with multidisciplinary variables and original indicators, including economic, demographic, cultural and geographic indicators, together with the Facebook Social Connectedness Index (SCI). It results from the process of gathering, embedding and integrating traditional and novel variables, resulting in this new multidisciplinary dataset that could significantly contribute to nowcast/forecast bilateral migration trends and migration drivers.

1 PAPER • NO BENCHMARKS YET

MPOSE2021 (MPOSE2021 Dataset for Short-time Human Action Recognition)

MPOSE2021, a dataset for real-time short-time HAR, suitable for both pose-based and RGB-based methodologies. It includes 15,429 sequences from 100 actors and different scenarios, with limited frames per scene (between 20 and 30). In contrast to other publicly available datasets, the peculiarity of having a constrained number of time steps stimulates the development of real-time methodologies that perform HAR with low latency and high throughput.

1 PAPER • NO BENCHMARKS YET

Multi-Labelled SMILES Odors dataset

This dataset is a multi-labelled SMILES odor dataset with 138 odor descriptors. This dataset was created for replicating the paper: A principal odor map unifies diverse tasks in olfactory perception.

1 PAPER • 1 BENCHMARK

Multicenter dataset of neuroimaging features (part I)

A detailed description of this dataset can be found in the Zenodo repository: https://zenodo.org/record/7845311#.ZK-jty9BxhE

1 PAPER • NO BENCHMARKS YET

Multicenter dataset of neuroimaging features (part II)

A detailed description of this dataset can be found in the Zenodo repository: https://zenodo.org/record/7845361#.ZK-k7y9BxhE

1 PAPER • NO BENCHMARKS YET

Multicenter dataset of simulated neuroimaging features - quadratic relationship with age

A detailed description of this dataset can be found in the Zenodo repository: https://zenodo.org/record/8119042#.ZK-jJC9BxhE

1 PAPER • NO BENCHMARKS YET

PEM Fuel Cell Dataset (Proton Exchange Membrane (PEM) Fuel Cell Dataset)

This dataset are about Nafion 112 membrane standard tests and MEA activation tests of PEM fuel cell in various operation condition. Dataset include two general electrochemical analysis method, Polarization and Impedance curves. In this dataset, effect of different pressure of H2/O2 gas, different voltages and various humidity conditions in several steps are considered. Behavior of PEM fuel cell during distinct operation condition tests, activation procedure and different operation condition before and after activation analysis can be concluded from data. In Polarization curves, voltage and power density change as a function of flows of H2/O2 and relative humidity. Resistance of the used equivalent circuit of fuel cell can be calculated from Impedance data. Thus, experimental response of the cell is obvious in the presented data, which is useful in depth analysis, simulation and material performance investigation in PEM fuel cell researches.

1 PAPER • NO BENCHMARKS YET

Participatory Budgeting Preferences Data Set

The data set includes information about 120+ elections (configuration settings and descriptive statistics), projects and 125k+ anonymized voters and their budget preferences. Preferences were sollicited with different elicitation methods (K-approval, knapsack, K-ranking and K-token). For some elections, voters provided also preferences under a secondary elicitation method, resulting in vote pairs from the same voter on the same budgeting question but with a different elicitation method.

1 PAPER • NO BENCHMARKS YET

Poisoned Water Detection using Smartphone embedded WiFi CSI data and Machine Learning Algorithms

Poisoned Water Detection using Smartphone embedded WiFi CSI data and Machine Learning Algorithms (Dataset and machine learning algorithms to detect poisoned water from clean water via using Smartphone embedded Wi-Fi CSI data.)

This repository contains a dataset and machine learning algorithms to detect poisoned water from clean water via using equivalent Smartphone embedded Wi-Fi CSI data.

1 PAPER • NO BENCHMARKS YET

Pylon Benchmark

Pylon Benchmark (Pylon Table Union Search Benchmark)

We create a new dataset from GitTables, a data lake of 1.7M tables extracted from CSV files on GitHub. The benchmark comprises 1,746 tables including union-able table subsets under topics selected from Schema.org: scholarly article, job posting, and music playlist. We end up with these three topics since we can find a fair number of union-able tables of them from diverse sources in the corpus (we can easily find union-able tables from a single source but they are less interesting for table union search as simple syntactic methods can identify all of them because of the same schema and consistent value representations).

1 PAPER • NO BENCHMARKS YET

RGZ EMU: Semantic Taxonomy

RGZ EMU: Semantic Taxonomy (Radio Galaxy Zoo EMU: Towards a Semantic Radio Galaxy Morphology Taxonomy)

The data used in - "Radio Galaxy Zoo EMU: Towards a Semantic Radio Galaxy Morphology Taxonomy" (Bowles et al. submitted) - "A New Task: Deriving Semantic Class Targets for the Physical Sciences" (Bowles et al. 2022: https://arxiv.org/abs/2210.14760) accepted at the Fifth Workshop on Machine Learning and the Physical Sciences, Neural Information Processing Systems 2022.

1 PAPER • NO BENCHMARKS YET

Reflective essays on CS TA experience

Teaching assistants (TAs) are heavily used in computer science courses as a way to handle high enrollment and still being able to offer students individual tutoring and detailed assessments. This data is the result of a multi-institutional, multi-national perspective of challenges that TAs in computer science face. 180 reflective essays written by TAs from three institutions across Europe were analyzed and coded. The thematic analysis resulted in five main challenges: becoming a professional TA, student-focused challenges, assessment, defining and using best practice and threats to best practice. In addition, these challenges were all identified within the essays from all three institutions, indicating that the identified challenges are not particularly context-dependent. (2021-04-11)

1 PAPER • NO BENCHMARKS YET

Regensburg Pediatric Appendicitis Dataset

This dataset was acquired in a retrospective study from a cohort of pediatric patients admitted with abdominal pain to Children’s Hospital St. Hedwig in Regensburg, Germany. Multiple abdominal B-mode ultrasound images were acquired for most patients, with the number of views varying from 1 to 15. The images depict various regions of interest, such as the abdomen’s right lower quadrant, appendix, intestines, lymph nodes and reproductive organs. Alongside multiple US images for each subject, the dataset includes information encompassing laboratory tests, physical examination results, clinical scores, such as Alvarado and pediatric appendicitis scores, and expert-produced ultrasonographic findings. Lastly, the subjects were labeled w.r.t. three target variables: diagnosis (appendicitis vs. no appendicitis), management (surgical vs. conservative) and severity (complicated vs. uncomplicated or no appendicitis). The study was approved by the Ethics Committee of the University of Regensburg (

1 PAPER • NO BENCHMARKS YET

Replication Data for: "Deciphering Bitcoin Blockchain Data by Cohort Analysis" Version 3.1

Bitcoin is a peer-to-peer electronic payment system that popularized rapidly in recent years. Usually, we need to query the complete history of bitcoin blockchain data to acquire variables of economic meaning. This becomes increasingly difficult now with over 1.6 billion historical transactions on the Bitcoin blockchain. It is thus important to query Bitcoin transaction data in a way that is more efficient and provides economic insights. We apply cohort analysis that interprets bitcoin blockchain data using methods developed for population data in social science. Specifically, we query and process the Bitcoin transaction input and output data within each daily cohort. With this, we then create datasets and visualizations for some key indicators of bitcoin transactions, including the daily lifespan distributions of accumulated spent transaction output (STXO) and the daily age distributions of accumulated unspent transaction output (UTXO). We provide a computationally feasible approach t

1 PAPER • NO BENCHMARKS YET

RotoEdit

Fact-based Text Editing dataset based on RotoWire dataset

1 PAPER • 1 BENCHMARK

Social Network Study

The SNS data (Valente et al., 2013) is a four-wave survey conducted in Los Angeles county, the United States, which features a sample of 1,795 high-school students. The survey collected information about high-school students between grades 10 to 12, a majority of them self-identified as Hispanic. Among the collected information we have socio-economic status, demographics, social networks, and consumption of alcohol, tobacco, and marijuana–substance use.

1 PAPER • NO BENCHMARKS YET

Supplementary Material

Supplementary Material (Annotation Table of Review)

The file contains an annotated list of papers that are included in the literature survey.

1 PAPER • NO BENCHMARKS YET

SupplyGraph (SupplyGraph: A Benchmark Dataset for Supply Chain Planning using Graph Neural Networks)

Graph Neural Networks (GNNs) have gained traction across different domains such as transportation, bio-informatics, language processing, and computer vision. However, there is a noticeable absence of research on applying GNNs to supply chain networks. Supply chain networks are inherently graphlike in structure, making them prime candidates for applying GNN methodologies. This opens up a world of possibilities for optimizing, predicting, and solving even the most complex supply chain problems. A major setback in this approach lies in the absence of real-world benchmark datasets to facilitate the research and resolution of supply chain problem using GNNs. To address the issue, we present a real-world benchmark dataset for temporal tasks, obtained from one of the leading FMCG companies in Bangladesh, focusing on supply chain planning for production purposes. The dataset includes temporal data as node features to enable sales predictions, production planning, and the identification of fact

1 PAPER • NO BENCHMARKS YET

Sustainable Venture Capital Survey 2022

To explore the nascent area of sustainable venture capital, a review of related research was conducted and social entrepreneurs & investors interviewed to construct a questionnaire assessing the interests and intentions of current & future ecosystem participants. Analysis of 114 responses received via several sampling methods revealed statistically significant relationships between investing preferences and genders, generations, sophistication, and other variables, all the way down to the level of individual UN Sustainable Development Goals (SDGs).

1 PAPER • NO BENCHMARKS YET

The Reddit Climate Change Dataset

The Reddit Climate Change Dataset is a dataset of 620K Reddit posts and 4.6M comments - all mentions of the terms "climate" and "change" until 2022-09-01 across the entire Reddit social network. Both were procured with SocialGrep's export feature and released as part of SocialGrep Reddit datasets. The posts are labeled with their subreddit, title, creation date, domain, selftext, and score. The comments are labeled with their subreddit, body, creation date, sentiment (calculated for you using a VADER pipeline), and score.

1 PAPER • NO BENCHMARKS YET

Undecided Voters in US Presidential Elections

This data contains the election polls for the 2004, 2008, 2012, and 2016 US presidential election by state including data on undecided voter proportions.

1 PAPER • NO BENCHMARKS YET

Datasets

110 dataset results for Tabular AND English