Our SRSD (Feynman) datasets are designed to discuss the performance of Symbolic Regression for Scientific Discovery. We carefully reviewed the properties of each formula and its variables in the Feynman Symbolic Regression Database to design reasonably realistic sampling range of values so that our SRSD datasets can be used for evaluating the potential of SRSD such as whether or not an SR method con (re)discover physical laws from such datasets.
1 PAPER • NO BENCHMARKS YET
The SNS data (Valente et al., 2013) is a four-wave survey conducted in Los Angeles county, the United States, which features a sample of 1,795 high-school students. The survey collected information about high-school students between grades 10 to 12, a majority of them self-identified as Hispanic. Among the collected information we have socio-economic status, demographics, social networks, and consumption of alcohol, tobacco, and marijuana–substance use.
The file contains an annotated list of papers that are included in the literature survey.
Graph Neural Networks (GNNs) have gained traction across different domains such as transportation, bio-informatics, language processing, and computer vision. However, there is a noticeable absence of research on applying GNNs to supply chain networks. Supply chain networks are inherently graphlike in structure, making them prime candidates for applying GNN methodologies. This opens up a world of possibilities for optimizing, predicting, and solving even the most complex supply chain problems. A major setback in this approach lies in the absence of real-world benchmark datasets to facilitate the research and resolution of supply chain problem using GNNs. To address the issue, we present a real-world benchmark dataset for temporal tasks, obtained from one of the leading FMCG companies in Bangladesh, focusing on supply chain planning for production purposes. The dataset includes temporal data as node features to enable sales predictions, production planning, and the identification of fact
Please see paper for questions. These are the answers to the surveys, processed and included in the paper via knitr
To explore the nascent area of sustainable venture capital, a review of related research was conducted and social entrepreneurs & investors interviewed to construct a questionnaire assessing the interests and intentions of current & future ecosystem participants. Analysis of 114 responses received via several sampling methods revealed statistically significant relationships between investing preferences and genders, generations, sophistication, and other variables, all the way down to the level of individual UN Sustainable Development Goals (SDGs).
The ARPA-E funded TERRA-REF project is generating open-access reference datasets for the study of plant sensing, genomics, and phenomics. Sensor data were generated by a field scanner sensing platform that captures color, thermal, hyperspectral, and active flourescence imagery as well as three dimensional structure and associated environmental measurements. This dataset is provided alongside data collected using traditional field methods in order to support calibration and validation of algorithms used to extract plot level phenotypes from these datasets.
The Reddit Climate Change Dataset is a dataset of 620K Reddit posts and 4.6M comments - all mentions of the terms "climate" and "change" until 2022-09-01 across the entire Reddit social network. Both were procured with SocialGrep's export feature and released as part of SocialGrep Reddit datasets. The posts are labeled with their subreddit, title, creation date, domain, selftext, and score. The comments are labeled with their subreddit, body, creation date, sentiment (calculated for you using a VADER pipeline), and score.
Click to add a brief description of the dataset (Markdown and LaTeX enabled).
AI-based digital twins are at the leading edge of theIndustry 4.0 revolution, which are technologically empowered bythe Internet of Things and real-time data analysis. Information collected from industrial assets is produced in a continuous fashion, yielding data streams that must be processed under stringent timing constraints. Such data streams are usually subject to non-stationary phenomena, causing that the data distribution of the streams may change, and thus the knowledge captured by models used for data analysis may become obsolete (leading to the so-called concept drift effect). The early detection of thechange (drift) is crucial for updating the model’s knowledge, which is challenging especially in scenarios where the ground truth associated to the stream data is not readily available. Among many other techniques, the estimation of the model’s confidence has been timidly suggested in a few studies as a criterion for detecting drifts in unsupervised settings. The goal of this m
This data contains the election polls for the 2004, 2008, 2012, and 2016 US presidential election by state including data on undecided voter proportions.
Context of the data sets The Zooniverse platform (www.zooniverse.org) has successfully built a large community of volunteers contributing to citizen science projects. Galaxy Zoo and the Milky Way Project were hosted there.
WDC Block is a benchmark for comparing the performance of blocking methods that are used as part of entity resolution pipelines.
1 PAPER • 3 BENCHMARKS
In our benchmark WHYSHIFT, we explore distribution shifts on 5 real-world tabular datasets from the economic and traffic sectors with natural spatiotemporal distribution shifts.We only pick 7 typical settings out of 22 settings and select only one representative target domain for each setting. In our benchmark, we specify the distribution shift pattern for each setting, and we provide the tools to identify risky regions with large $Y|X$ shifts and to diagnose the performance degradation.
Fact-based Text Editing dataset based on WebNLG dataset.
1 PAPER • 1 BENCHMARK
WikiTableSet is a large publicly available image-based table recognition dataset in three languages built from Wikipedia. WikiTableSet contains nearly 4 million English table images, 590K Japanese table images, 640k French table images with corresponding HTML representation, and cell bounding boxes. We build a Wikipedia table extractor WTabHTML and use this to extract tables (in HTML code format) from the 2022-03-01 dump of Wikipedia. In this study, we select Wikipedia tables from three representative languages, i.e., English, Japanese, and French; however, the dataset could be extended to around 300 languages with 17M tables using our table extractor. Second, we normalize the HTML tables following the PubTabNet format (separating table headers and table data, removing CSS and style tags). Finally, we use Chrome and Selenium to render table images from table HTML codes. This dataset provides a standard benchmark for studying table recognition algorithms in different languages or even
Wikipedia is the largest and most read online free encyclopedia currently existing. As such, Wikipedia offers a large amount of data on all its own contents and interactions around them, as well as different types of open data sources. This makes Wikipedia a unique data source that can be analyzed with quantitative data science techniques. However, the enormous amount of data makes it difficult to have an overview, and sometimes many of the analytical possibilities that Wikipedia offers remain unknown. In order to reduce the complexity of identifying and collecting data on Wikipedia and expanding its analytical potential, after collecting different data from various sources and processing them, we have generated a dedicated Wikipedia Knowledge Graph aimed at facilitating the analysis, contextualization of the activity and relations of Wikipedia pages, in this case limited to its English edition. We share this Knowledge Graph dataset in an open way, aiming to be useful for a wide range
Natural Vertical Partitioned CVR Dataset for Vertical Federated Learning
[Real or Fake] : Fake Job Description Prediction This dataset contains 18K job descriptions out of which about 800 are fake. The data consists of both textual information and meta-information about the jobs. The dataset can be used to create classification models which can learn the job descriptions which are fraudulent.
This dataset provides wireless measurements from two industrial testbeds: iV2V (industrial Vehicle-to-Vehicle) and iV2I+ (industrial Vehicular-to-Infrastructure plus sensor).
The dataset contains standard contexts of the lattices of all atomic lattices in the Concept Explorer format.
Objective This study introduces the BlendedICU dataset, a massive dataset of international intensive care data. This dataset aims to facilitate generalizability studies of machine learning models, as well as statistical studies of clinical practices in the intensive care units.
0 PAPER • NO BENCHMARKS YET
Can you detect fraud from customer transactions? Imagine standing at the check-out counter at the grocery store with a long line behind you and the cashier not-so-quietly announces that your card has been declined. In this moment, you probably aren’t thinking about the data science that determined your fate.
ata Set Name: Rice Dataset (Commeo and Osmancik) Abstract: A total of 3810 rice grain's images were taken for the two species (Cammeo and Osmancik), processed and feature inferences were made. 7 morphological features were obtained for each grain of rice.
We present a real data analysis of a CT experiment that was conducted in Italy for 8 months and involved more than 100,000 CT app users.
Standardized Multi-Channel Dataset for Glaucoma (SMDG-19) is a collection and standardization of 19 public datasets, comprised of full-fundus glaucoma images, associated image metadata like, optic disc segmentation, optic cup segmentation, blood vessel segmentation, and any provided per-instance text metadata like sex and age. This dataset is the largest public repository of fundus images with glaucoma.
The Reddit COVID Dataset is a dataset of 4.51M Reddit posts and 17.8M comments - all mentions of COVID until 2021-10-25 across the entire Reddit social network. Both were procured with SocialGrep's export feature and released as part of SocialGrep Reddit datasets. The posts are labeled with their subreddit, title, creation date, domain, selftext, and score. The comments are labeled with their subreddit, body, creation date, sentiment (calculated for you using a VADER pipeline), and score.
X-Wines is a consistent wine dataset containing 100,646 instances and 21 million real evaluations carried out by users. Data were collected on the open Web in 2022 and pre-processed for wider free use. They refer to the scale 1–5 ratings carried out over a period of 10 years (2012–2021) for wines produced in 62 different countries.