Dataset introduced by Xifeng Yan et al.
1 PAPER • NO BENCHMARKS YET
HAM is a dataset for molecular graph partitioning. This dataset contains coarse-grained (CG) mappings of 1206 organic molecules with less than 25 heavy atoms. Each molecule was downloaded from the PubChem database as SMILES. One molecule was assigned to two annotators to compare the human agreement between CG mappings. Downloaded SMILES were hand-mapped. The completed annotations were reviewed by a third person, to identify and remove unreasonable mappings (eg: one bead mappings) which did not agree with the given guidelines. Hence, there are 1.68 annotations per molecule in the current database (16% removed).
Hypertention Disease Medication dataset.
Multi-Modal Hate Speech Detection with Graph Context.
The dataset contains constructed multi-modal features (visual and textual), pseudo-labels (on heritage values and attributes), and graph structures (with temporal, social, and spatial links) constructed using User-Generated Content data collected from Flickr social media platform in three global cities containing UNESCO World Heritage property (Amsterdam, Suzhou, Venice). The motivation of data collection in this project is to provide datasets that could be both directly applicable for ML communities as test-bed, and theoretically informative for heritage and urban scholars to draw conclusions on for planning decision-making.
HoaxItaly consists of over 1 million tweets shared during 2019 and containing links to thousands of news articles published on two classes of Italian outlets: (1) disinformation websites, i.e. outlets which have been repeatedly flagged by journalists and fact-checkers for producing low-credibility content such as false news, hoaxes, click-bait, misleading and hyper-partisan stories; (2) fact-checking websites which notably debunk and verify online news and claims. The dataset includes title and body for approximately 37k news articles.
A large dataset from the Inductive Link Prediction Challenge 2022. Training graph contains 46K entities, 130 relations, 202K triples. Inference graph contains 30K entities, 130 relations, 77K triples. Validation and test triples to predict belong to the inference graph.
1 PAPER • 1 BENCHMARK
A small dataset from the Inductive Link Prediction Challenge 2022. Training graph contains 10K entities, 96 relations, 78K triples. Inference graph contains 7K entities, 96 relations, 21K triples. Validation and test triples to predict belong to the inference graph.
IMCPT-SparseGM dataset is a new visual graph matching benchmark addressing partial matching and graphs with larger sizes, based on the novel stereo benchmark Image Matching Challenge PhotoTourism (IMC-PT) 2020. This dataset is released in CVPR 2023 paper Deep Learning of Partial Graph Matching via Differentiable Top-K.
We release 280 synthetic IAM graphs generated using IAM graphs of commercial companies. Specifically, we vary the number of nodes, but keep graph density as is, i.e. in the range of 0.259 ± 0.198 (avg ± std). To generate a synthetic graph, we first sample the number of users and datastores from uniform distributions over the following intervals [10, 150] and [50, 300] respectively that cover variations of those parameters across real graphs. After fixing node counts we sample with replacement the actual nodes from a real world graph, which is chosen at random. Then we add Gaussian N(0, 0.01) noise to node embeddings and renormalize them. To match the graph density with the density of the underlying baseline we sample edges from a multinomial distribution, where each component is proportional to the cosine distance between a user and a datastore embeddings. Also we enforce the invariant that dynamic edges are always a subset of all permission edges. A synthetic graph generated in such
JoCAD is a dataset for anomaly detection in citation networks.
The KACC benchmark consists of three subtasks that can be applied to knowledge graphs: knowledge abstraction, knowledge concretization and knowledge completion.
KGRC-RDF-star is an RDF-star dataset converted from KGRC-RDF, which is a Knowledge graph dataset of novel stories.
The LSEC (Live Stream E-Commerce) dataset has two subsets: LSEC-Small and LSEC-Large. It is a dataset for studying E-commerce transactions in the context of live streams, where the streames are talking about products while interacting with their audience. The dataset consists of interaction information among streamers, users, and products.
An RDF knowledge graph that provides comprehensive, current information about almost 400,000 machine learning publications. This includes the tasks addressed, the datasets utilized, the methods implemented, and the evaluations conducted, along with their results. Compared to its non-RDF-based counterpart Papers With Code, LPWC not only translates the latest advancements in machine learning into RDF format, but also enables novel ways for scientific impact quantification and scholarly key content recommendation. LPWC is openly accessible and is licensed under CC-BY-SA 4.0. As a knowledge graph in the Linked Open Data cloud, we offer LPWC in multiple formats, from RDF dump files to a SPARQL endpoint for direct web queries, as well as a data source with resolvable URIs and links to the data sources SemOpenAlex, Wikidata, and DBLP. Additionally, we supply knowledge graph embeddings, enabling LPWC to be readily applied in machine learning applications.
The set is created using molecule SMILES retrieved from the database PubChem. Images are then generated from SMILES using the molecule drawing library RDKit. The synthetic set is augmented at multiple levels:
The dataset contains entities from IMDB, TheMovieDB and TheTVDB with goldstandard matches between the sources. Due to the licensing of IMDB we provide a script to build the IMDB part of the dataset yourself.
This is the large version of the MuMiN dataset.
This is the medium version of the MuMiN dataset.
This is the small version of the MuMiN dataset.
This dataset contains information on application install interactions of users in the Myket android application market. The dataset was created for the purpose of evaluating interaction prediction models, requiring user and item identifiers along with timestamps of the interactions. Hence, the dataset can be used for interaction prediction and building a recommendation system. Furthermore, the data forms a dynamic network of interactions, and we can also perform network representation learning on the nodes in the network, which are users and applications.
The OU-ISIR Gait Database, Multi-View Large Population Database with Pose Sequence (OUMVLP-Pose) is meant to aid research efforts in the general area of developing, testing and evaluating algorithms for model-based gait recognition.
This is the set of graphs used in the PACE 2022 challenge for computing the Directed Feedback Vertex Set, from the Heuristic track. It consists of 200 labelled directed graphs. The graphs are mostly not symmetric (an edge form u->v does not imply an edge from v->u), although some are symmetric. The graph labels are integers ranging from 1 to N.
The PART-OF dataset is a dataset of relations extracted from a medical ontology. The different entities in the ontology are parts of the human body. The dataset has 16,894 nodes with 19,436 edges between them.
ReviewRobot Dataset Overview This repository contains data for paper ReviewRobot: Explainable Paper Review Generation based on Knowledge Synthesis. [Dataset]
The Room environment - v0 There is a newer version, v1
The Room environment - v1 For the documentation of RoomEnv-v0, click the corresponding buttons.
Click to add a brief description of the dataset (Markdown and LaTeX enabled).
SidechainNet is a protein structure prediction dataset that directly extends ProteinNet. Specifically, SidechainNet adds measurements for protein angles and coordinates that describe the complete, all-atom protein structure (backbone and sidechain, excluding hydrogens) instead of the protein backbone alone.
FLORIS farm dataset A dataset for graph neural network modeling of wind farms. The current version of the dataset contains two farms, with very different geometry but similar inter-turbine statistics. The wind farms were simulated with the steady-state wake model FLORIS.
Graph Neural Networks (GNNs) have gained traction across different domains such as transportation, bio-informatics, language processing, and computer vision. However, there is a noticeable absence of research on applying GNNs to supply chain networks. Supply chain networks are inherently graphlike in structure, making them prime candidates for applying GNN methodologies. This opens up a world of possibilities for optimizing, predicting, and solving even the most complex supply chain problems. A major setback in this approach lies in the absence of real-world benchmark datasets to facilitate the research and resolution of supply chain problem using GNNs. To address the issue, we present a real-world benchmark dataset for temporal tasks, obtained from one of the leading FMCG companies in Bangladesh, focusing on supply chain planning for production purposes. The dataset includes temporal data as node features to enable sales predictions, production planning, and the identification of fact
This dataset accompanies the paper `Learning the mechanisms of network growth' by the same authors. The dataset contains 6733 networks of size 20,000 each generated in accordance to different combination of three mechanisms: fitness, aging and preferential attachment. The goal is to use machine learning to identify the combination of mechanisms that was used to create the network. The dataset includes static features from the literature and two version of our newly developed dynamic features. net
TextWorld KG is a dynamic Knowledge Graph (KG) extraction dataset. It is based on a set of text-based games generated using. That framework allows to extract the underlying partial KG for every state, i.e., the subgraph that represents the agent’s partial knowledge of the world – what it has observed so far. All games share the same overarching theme: the agent finds itself hungry in a simple modern house with the goal of gathering ingredients and cooking a meal.
We introduce USPTO-30K, a large-scale benchmark dataset of annotated molecule images, which overcomes these limitations. It is created using the pairs of images and MolFiles by the United States Patent and Trademark Office. Each molecule was independently selected among all the available documents from 2001 to 2020. The set consists of three subsets to decouple the study of clean molecules, molecules with abbreviations and large molecules.
The raw data are obtained from an industrial plant for ultra-processed food production. The sampling was carried out every 5 minutes while the total production cycle takes approximately 3 hours, from raw ingredients to final semi- finished products. The extracted data represent approximately 80 days of production. Variables 2 − 14 belonging to 4 specific phases of the process and influence the qualitative variable 17. Variables 15 and 16 are external variables not controlled by the process which affect the final product. It should also be noted that some variation may be due to changes in raw materials, in production flow (variable 1) or to possible reconfiguration between weeks. However while the magnitude of effects may change between weeks, the causal relationships are dictated by the plant and process dynamics and are consistent (at the best of potential un-cofounder and faults) throughout the production .
YoutubeGraph-Dyn is an evolving graph dataset generated from YouTube real-world interactions. It can be used to study temporal evolution on graphs. YoutubeGraph-Dyn provides intra-day time granularity (with 416 snapshots taken every 6 hours for a period of 104 days), multi-modal relationships that capture different aspects of the data, multiple attributes including timestamped, non-timestamped, word embeddings, and integers.
ZeroKBC is comprehensive benchmark that covers all scenarios of zero-shot Knowledge Base Completion (KBC) task. It has 3 zero-shot scenarios with 8 fine-grained settings.
This file contains the data and code for the publication "The Federal Reserve's Response to the Global Financial Crisis and Its Long-Term Impact: An Interrupted Time-Series Natural Experimental Analysis" by A. C. Kamkoum, 2023.
This is the list of all doges of the Venetian Republic, as well as their wives, if there's a record that they existed. They include name, surname if known, and date of their office, as well as the date of their weddings. Data has been extracted from the Wikipedia, with some errors fixed checking against other sources.
hERG is a large-scale biophysics federated molecular dataset related to cardiac toxicity. It consists of 10,572 compounds, with an average of 29.39 nodes and 94.09 edges in each graph.
This dataset is composed of paired videos of people dancing 3 different music styles: Ballet, Michael Jackson and Salsa. It contains multimodal data (visual data, temporal-graphs and audio) careful-selected from publicly available videos of dancers performing representative movements of the music style and audio data from the respective styles.
pmuBAGE (the Benchmarking Assortment of Generated PMU Events) is a dataset that consists of almost 1000 instances of labeled event data to encourage benchmark evaluations on phasor measurement unit (PMU) data analytics. PMU data are challenging to obtain, especially those covering event periods. Nevertheless, power system problems have recently seen phenomenal advancements via data-driven machine learning solutions. A highly accessible standard benchmarking dataset would enable a drastic acceleration of the development of successful machine learning techniques in this field.
The AIDS Antiviral Screen dataset is a dataset of screens checking tens of thousands of compounds for evidence of anti-HIV activity. The available screen results are chemical graph-structured data of these various compounds.
0 PAPER • NO BENCHMARKS YET
A fundamental component of human vision is our ability to parse complex visual scenes and judge the relations between their constituent objects. AI benchmarks for visual reasoning have driven rapid progress in recent years with state-of-the-art systems now reaching human accuracy on some of these benchmarks. Yet, there remains a major gap between humans and AI systems in terms of the sample efficiency with which they learn new visual reasoning tasks. Humans' remarkable efficiency at learning has been at least partially attributed to their ability to harness compositionality -- allowing them to efficiently take advantage of previously gained knowledge when learning new tasks. Here, we introduce a novel visual reasoning benchmark, Compositional Visual Relations (CVR), to drive progress towards the development of more data-efficient learning algorithms. We take inspiration from fluidic intelligence and non-verbal reasoning tests and describe a novel method for creating compositions of abs
DPB-5L is a Multilingual KG dataset containing 5 KGs in English, French, Japanese, Greek, and Spanish. The dataset is used for the Knowledge Graph Completion and Entity Alignment task. DPB-5L (Japanese) is a subset of DPB-5L with Japanese KG.