🔔 Share your dataset with the ML community!

Filter by Modality (clear)

Filter by Task

Filter by Language

174 dataset results for Tabular

Our SRSD (Feynman) datasets are designed to discuss the performance of Symbolic Regression for Scientific Discovery. We carefully reviewed the properties of each formula and its variables in the Feynman Symbolic Regression Database to design reasonably realistic sampling range of values so that our SRSD datasets can be used for evaluating the potential of SRSD such as whether or not an SR method con (re)discover physical laws from such datasets.

1 PAPER • NO BENCHMARKS YET

SRSD-Feynman (Medium set)

1 PAPER • NO BENCHMARKS YET

Social Network Study

The SNS data (Valente et al., 2013) is a four-wave survey conducted in Los Angeles county, the United States, which features a sample of 1,795 high-school students. The survey collected information about high-school students between grades 10 to 12, a majority of them self-identified as Hispanic. Among the collected information we have socio-economic status, demographics, social networks, and consumption of alcohol, tobacco, and marijuana–substance use.

1 PAPER • NO BENCHMARKS YET

Supplementary Material

Supplementary Material (Annotation Table of Review)

The file contains an annotated list of papers that are included in the literature survey.

1 PAPER • NO BENCHMARKS YET

SupplyGraph (SupplyGraph: A Benchmark Dataset for Supply Chain Planning using Graph Neural Networks)

Graph Neural Networks (GNNs) have gained traction across different domains such as transportation, bio-informatics, language processing, and computer vision. However, there is a noticeable absence of research on applying GNNs to supply chain networks. Supply chain networks are inherently graphlike in structure, making them prime candidates for applying GNN methodologies. This opens up a world of possibilities for optimizing, predicting, and solving even the most complex supply chain problems. A major setback in this approach lies in the absence of real-world benchmark datasets to facilitate the research and resolution of supply chain problem using GNNs. To address the issue, we present a real-world benchmark dataset for temporal tasks, obtained from one of the leading FMCG companies in Bangladesh, focusing on supply chain planning for production purposes. The dataset includes temporal data as node features to enable sales predictions, production planning, and the identification of fact

1 PAPER • NO BENCHMARKS YET

Survey answers

Survey answers (Answers to surveys in both papers, as well as processed answers)

Please see paper for questions. These are the answers to the surveys, processed and included in the paper via knitr

1 PAPER • NO BENCHMARKS YET

Sustainable Venture Capital Survey 2022

To explore the nascent area of sustainable venture capital, a review of related research was conducted and social entrepreneurs & investors interviewed to construct a questionnaire assessing the interests and intentions of current & future ecosystem participants. Analysis of 114 responses received via several sampling methods revealed statistically significant relationships between investing preferences and genders, generations, sophistication, and other variables, all the way down to the level of individual UN Sustainable Development Goals (SDGs).

1 PAPER • NO BENCHMARKS YET

TERRA-REF (TERRA-REF, An open reference data set from high resolution genomics, phenomics, and imaging sensors)

The ARPA-E funded TERRA-REF project is generating open-access reference datasets for the study of plant sensing, genomics, and phenomics. Sensor data were generated by a field scanner sensing platform that captures color, thermal, hyperspectral, and active flourescence imagery as well as three dimensional structure and associated environmental measurements. This dataset is provided alongside data collected using traditional field methods in order to support calibration and validation of algorithms used to extract plot level phenotypes from these datasets.

1 PAPER • NO BENCHMARKS YET

The Reddit Climate Change Dataset

The Reddit Climate Change Dataset is a dataset of 620K Reddit posts and 4.6M comments - all mentions of the terms "climate" and "change" until 2022-09-01 across the entire Reddit social network. Both were procured with SocialGrep's export feature and released as part of SocialGrep Reddit datasets. The posts are labeled with their subreddit, title, creation date, domain, selftext, and score. The comments are labeled with their subreddit, body, creation date, sentiment (calculated for you using a VADER pipeline), and score.

1 PAPER • NO BENCHMARKS YET

Trust Dynamics and Market Behavior in Cryptocurrency

Trust Dynamics and Market Behavior in Cryptocurrency (Trust Dynamics and Market Behavior in Cryptocurrency: A Comparative Study of Centralized and Decentralized Exchanges)

Click to add a brief description of the dataset (Markdown and LaTeX enabled).

1 PAPER • NO BENCHMARKS YET

Uncertainty and Concept Drift (On the Connection between Concept Drift and Uncertainty in Industrial Artificial Intelligence)

AI-based digital twins are at the leading edge of theIndustry 4.0 revolution, which are technologically empowered bythe Internet of Things and real-time data analysis. Information collected from industrial assets is produced in a continuous fashion, yielding data streams that must be processed under stringent timing constraints. Such data streams are usually subject to non-stationary phenomena, causing that the data distribution of the streams may change, and thus the knowledge captured by models used for data analysis may become obsolete (leading to the so-called concept drift effect). The early detection of thechange (drift) is crucial for updating the model’s knowledge, which is challenging especially in scenarios where the ground truth associated to the stream data is not readily available. Among many other techniques, the estimation of the model’s confidence has been timidly suggested in a few studies as a criterion for detecting drifts in unsupervised settings. The goal of this m

1 PAPER • NO BENCHMARKS YET

Undecided Voters in US Presidential Elections

This data contains the election polls for the 2004, 2008, 2012, and 2016 US presidential election by state including data on undecided voter proportions.

1 PAPER • NO BENCHMARKS YET

Uniswap

Uniswap (Replication Data for: Uniswap Daily Transaction Indices by Network)

Click to add a brief description of the dataset (Markdown and LaTeX enabled).

1 PAPER • NO BENCHMARKS YET

Volunteer task execution events in Galaxy Zoo and The Milky Way citizen science projects

Context of the data sets The Zooniverse platform (www.zooniverse.org) has successfully built a large community of volunteers contributing to citizen science projects. Galaxy Zoo and the Milky Way Project were hosted there.

1 PAPER • NO BENCHMARKS YET

WDC Block

WDC Block (WDC Block: A Blocking Benchmark)

WDC Block is a benchmark for comparing the performance of blocking methods that are used as part of entity resolution pipelines.

1 PAPER • 3 BENCHMARKS

WHYSHIFT

In our benchmark WHYSHIFT, we explore distribution shifts on 5 real-world tabular datasets from the economic and traffic sectors with natural spatiotemporal distribution shifts.We only pick 7 typical settings out of 22 settings and select only one representative target domain for each setting. In our benchmark, we specify the distribution shift pattern for each setting, and we provide the tools to identify risky regions with large $Y|X$ shifts and to diagnose the performance degradation.

1 PAPER • NO BENCHMARKS YET

WebEdit

Fact-based Text Editing dataset based on WebNLG dataset.

1 PAPER • 1 BENCHMARK

WikiTableSet

WikiTableSet (Wikipedia Table Image Dataset)

WikiTableSet is a large publicly available image-based table recognition dataset in three languages built from Wikipedia. WikiTableSet contains nearly 4 million English table images, 590K Japanese table images, 640k French table images with corresponding HTML representation, and cell bounding boxes. We build a Wikipedia table extractor WTabHTML and use this to extract tables (in HTML code format) from the 2022-03-01 dump of Wikipedia. In this study, we select Wikipedia tables from three representative languages, i.e., English, Japanese, and French; however, the dataset could be extended to around 300 languages with 17M tables using our table extractor. Second, we normalize the HTML tables following the PubTabNet format (separating table headers and table data, removing CSS and style tags). Finally, we use Chrome and Selenium to render table images from table HTML codes. This dataset provides a standard benchmark for studying table recognition algorithms in different languages or even

1 PAPER • NO BENCHMARKS YET

Wikipedia Knowledge Graph dataset

Wikipedia is the largest and most read online free encyclopedia currently existing. As such, Wikipedia offers a large amount of data on all its own contents and interactions around them, as well as different types of open data sources. This makes Wikipedia a unique data source that can be analyzed with quantitative data science techniques. However, the enormous amount of data makes it difficult to have an overview, and sometimes many of the analytical possibilities that Wikipedia offers remain unknown. In order to reduce the complexity of identifying and collecting data on Wikipedia and expanding its analytical potential, after collecting different data from various sources and processing them, we have generated a dedicated Wikipedia Knowledge Graph aimed at facilitating the analysis, contextualization of the activity and relations of Wikipedia pages, in this case limited to its English edition. We share this Knowledge Graph dataset in an open way, aiming to be useful for a wide range

1 PAPER • NO BENCHMARKS YET

adVFed

adVFed (Tencent Federated Advertising CVR Dataset)

Natural Vertical Partitioned CVR Dataset for Vertical Federated Learning

1 PAPER • NO BENCHMARKS YET

fake

fake (Real / Fake Job Posting Prediction)

[Real or Fake] : Fake Job Description Prediction This dataset contains 18K job descriptions out of which about 800 are fake. The data consists of both textual information and meta-information about the jobs. The dataset can be used to create classification models which can learn the job descriptions which are fraudulent.

1 PAPER • 1 BENCHMARK

iV2V and iV2I+ (AI4Mobile Industrial Wireless Datasets: iV2V and iV2I+)

This dataset provides wireless measurements from two industrial testbeds: iV2V (industrial Vehicle-to-Vehicle) and iV2I+ (industrial Vehicular-to-Infrastructure plus sensor).

1 PAPER • NO BENCHMARKS YET

standard atomic contexts (standard contexts for the lattices of atomic lattices)

The dataset contains standard contexts of the lattices of all atomic lattices in the Concept Explorer format.

1 PAPER • NO BENCHMARKS YET

BlendedICU, the first harmonized, international intensive care dataset

Objective This study introduces the BlendedICU dataset, a massive dataset of international intensive care data. This dataset aims to facilitate generalizability studies of machine learning models, as well as statistical studies of clinical practices in the intensive care units.

0 PAPER • NO BENCHMARKS YET

IEEE-CIS Fraud Detection

Can you detect fraud from customer transactions? Imagine standing at the check-out counter at the grocery store with a long line behind you and the cashier not-so-quietly announces that your card has been declined. In this moment, you probably aren’t thinking about the data science that determined your fate.

0 PAPER • NO BENCHMARKS YET

Rice Dataset Commeo and Osmancik

ata Set Name: Rice Dataset (Commeo and Osmancik) Abstract: A total of 3810 rice grain's images were taken for the two species (Cammeo and Osmancik), processed and feature inferences were made. 7 morphological features were obtained for each grain of rice.

0 PAPER • NO BENCHMARKS YET

SMCOVID19-CT

SMCOVID19-CT (Contact Tracing Data (from Italian SM-COVID-19 App))

We present a real data analysis of a CT experiment that was conducted in Italy for 8 months and involved more than 100,000 CT app users.

0 PAPER • NO BENCHMARKS YET

SMDG (Standardized Multi-Channel Dataset for Glaucoma)

Standardized Multi-Channel Dataset for Glaucoma (SMDG-19) is a collection and standardization of 19 public datasets, comprised of full-fundus glaucoma images, associated image metadata like, optic disc segmentation, optic cup segmentation, blood vessel segmentation, and any provided per-instance text metadata like sex and age. This dataset is the largest public repository of fundus images with glaucoma.

0 PAPER • NO BENCHMARKS YET

The Reddit COVID Dataset

The Reddit COVID Dataset is a dataset of 4.51M Reddit posts and 17.8M comments - all mentions of COVID until 2021-10-25 across the entire Reddit social network. Both were procured with SocialGrep's export feature and released as part of SocialGrep Reddit datasets. The posts are labeled with their subreddit, title, creation date, domain, selftext, and score. The comments are labeled with their subreddit, body, creation date, sentiment (calculated for you using a VADER pipeline), and score.

0 PAPER • NO BENCHMARKS YET

X-Wines (A Wine Dataset for Recommender Systems and Machine Learning)

X-Wines is a consistent wine dataset containing 100,646 instances and 21 million real evaluations carried out by users. Data were collected on the open Web in 2022 and pre-processed for wider free use. They refer to the scale 1–5 ratings carried out over a period of 10 years (2012–2021) for wines produced in 62 different countries.

0 PAPER • NO BENCHMARKS YET