The PubMed dataset consists of 19717 scientific publications from PubMed database pertaining to diabetes classified into one of three classes. The citation network consists of 44338 links. Each publication in the dataset is described by a TF/IDF weighted word vector from a dictionary which consists of 500 unique words.
1,094 PAPERS • 24 BENCHMARKS
The Open Graph Benchmark (OGB) is a collection of realistic, large-scale, and diverse benchmark datasets for machine learning on graphs. OGB datasets are automatically downloaded, processed, and split using the OGB Data Loader. The model performance can be evaluated using the OGB Evaluator in a unified manner. OGB is a community-driven initiative in active development.
847 PAPERS • 16 BENCHMARKS
The Reddit dataset is a graph dataset from Reddit posts made in the month of September, 2014. The node label in this case is the community, or “subreddit”, that a post belongs to. 50 large communities have been sampled to build a post-to-post graph, connecting posts if the same user comments on both. In total this dataset contains 232,965 posts with an average degree of 492. The first 20 days are used for training and the remaining days for testing (with 30% used for validation). For features, off-the-shelf 300-dimensional GloVe CommonCrawl word vectors are used.
603 PAPERS • 13 BENCHMARKS
The Cora dataset consists of 2708 scientific publications classified into one of seven classes. The citation network consists of 5429 links. Each publication in the dataset is described by a 0/1-valued word vector indicating the absence/presence of the corresponding word from the dictionary. The dictionary consists of 1433 unique words.
510 PAPERS • 20 BENCHMARKS
The CiteSeer dataset consists of 3312 scientific publications classified into one of six classes. The citation network consists of 4732 links. Each publication in the dataset is described by a 0/1-valued word vector indicating the absence/presence of the corresponding word from the dictionary. The dictionary consists of 3703 unique words.
327 PAPERS • 14 BENCHMARKS
PROTEINS is a dataset of proteins that are classified as enzymes or non-enzymes. Nodes represent the amino acids and two nodes are connected by an edge if they are less than 6 Angstroms apart.
326 PAPERS • 1 BENCHMARK
IMDB-BINARY is a movie collaboration dataset that consists of the ego-networks of 1,000 actors/actresses who played roles in movies in IMDB. In each graph, nodes represent actors/actress, and there is an edge between them if they appear in the same movie. These graphs are derived from the Action and Romance genres.
288 PAPERS • 2 BENCHMARKS
In particular, MUTAG is a collection of nitroaromatic compounds and the goal is to predict their mutagenicity on Salmonella typhimurium. Input graphs are used to represent chemical compounds, where vertices stand for atoms and are labeled by the atom type (represented by one-hot encoding), while edges between vertices represent bonds between the corresponding atoms. It includes 188 samples of chemical compounds with 7 discrete node labels.
249 PAPERS • 3 BENCHMARKS
COLLAB is a scientific collaboration dataset. A graph corresponds to a researcher’s ego network, i.e., the researcher and its collaborators are nodes and an edge indicates collaboration between two researchers. A researcher’s ego network has three possible labels, i.e., High Energy Physics, Condensed Matter Physics, and Astro Physics, which are the fields that the researcher belongs to. The dataset has 5,000 graphs and each graph has label 0, 1, or 2.
234 PAPERS • 2 BENCHMARKS
The NCI1 dataset comes from the cheminformatics domain, where each input graph is used as representation of a chemical compound: each vertex stands for an atom of the molecule, and edges between vertices represent bonds between atoms. This dataset is relative to anti-cancer screens where the chemicals are assessed as positive or negative to cell lung cancer. Each vertex has an input label representing the corresponding atom type, encoded by a one-hot-encoding scheme into a vector of 0/1 elements.
230 PAPERS • 2 BENCHMARKS
IMDB-MULTI is a relational dataset that consists of a network of 1000 actors or actresses who played roles in movies in IMDB. A node represents an actor or actress, and an edge connects two nodes when they appear in the same movie. In IMDB-MULTI, the edges are collected from three different genres: Comedy, Romance and Sci-Fi.
229 PAPERS • 2 BENCHMARKS
ENZYMES is a dataset of 600 protein tertiary structures obtained from the BRENDA enzyme database. The ENZYMES dataset contains 6 enzymes.
174 PAPERS • 1 BENCHMARK
REDDIT-BINARY consists of graphs corresponding to online discussions on Reddit. In each graph, nodes represent users, and there is an edge between them if at least one of them respond to the other’s comment. There are four popular subreddits, namely, IAmA, AskReddit, TrollXChromosomes, and atheism. IAmA and AskReddit are two question/answer based subreddits, and TrollXChromosomes and atheism are two discussion-based subreddits. A graph is labeled according to whether it belongs to a question/answer-based community or a discussion-based community.
139 PAPERS • 2 BENCHMARKS
PTC is a collection of 344 chemical compounds represented as graphs which report the carcinogenicity for rats. There are 19 node labels for each node.
103 PAPERS • 1 BENCHMARK
Reddit-5K is a relational dataset extracted from Reddit.
75 PAPERS • 1 BENCHMARK
Tudataset: A collection of benchmark datasets for learning with graphs
71 PAPERS • 1 BENCHMARK
AIDS is a graph dataset. It consists of 2000 graphs representing molecular compounds which are constructed from the AIDS Antiviral Screen Database of Active Compounds. It contains 4395 chemical compounds, of which 423 belong to class CA, 1081 to CM, and the remaining compounds to CI.
54 PAPERS • 1 BENCHMARK
The Long Range Graph Benchmark (LRGB) is a collection of 5 graph learning datasets that arguably require long-range reasoning to achieve strong performance in a given task. The 5 datasets in this benchmark can be used to prototype new models that can capture long range dependencies in graphs.
51 PAPERS • 5 BENCHMARKS
CSL is a synthetic dataset introduced in Murphy et al. (2019) to test the expressivity of GNNs. In particular, graphs are isomorphic if they have the same degree and the task is to classify non-isomorphic graphs.
29 PAPERS • 2 BENCHMARKS
Reddit12k contains 11929 graphs each corresponding to an online discussion thread where nodes represent users, and an edge represents the fact that one of the two users responded to the comment of the other user. There is 1 of 11 graph labels associated with each of these 11929 discussion graphs, representing the category of the community.
24 PAPERS • NO BENCHMARKS YET
The LINUX dataset consists of 48,747 Program Dependence Graphs (PDG) generated from the Linux kernel. Each graph represents a function, where a node represents one statement and an edge represents the dependency between the two statements
13 PAPERS • NO BENCHMARKS YET
MalNet is a large public graph database, representing a large-scale ontology of software function call graphs. MalNet contains over 1.2 million graphs, averaging over 17k nodes and 39k edges per graph, across a hierarchy of 47 types and 696 families.
13 PAPERS • 4 BENCHMARKS
Mutagenicity is a chemical compound dataset of drugs, which can be categorized into two classes: mutagen and non-mutagen.
11 PAPERS • 1 BENCHMARK
For benchmarking, please refer to its variant UPFD-POL and UPFD-GOS.
8 PAPERS • 2 BENCHMARKS
The Gossipcop variant of the UPFD dataset for benchmarking.
3 PAPERS • 1 BENCHMARK
The PolitiFact variant of the UPFD dataset for benchmarking.
2 PAPERS • 1 BENCHMARK
Synthetic graph classification datasets with the task of recognizing the connectivity of same-colored nodes in 4 graphs of varying topology.
1 PAPER • NO BENCHMARKS YET
This dataset accompanies the paper `Learning the mechanisms of network growth' by the same authors. The dataset contains 6733 networks of size 20,000 each generated in accordance to different combination of three mechanisms: fitness, aging and preferential attachment. The goal is to use machine learning to identify the combination of mechanisms that was used to create the network. The dataset includes static features from the literature and two version of our newly developed dynamic features. net
1 PAPER • 1 BENCHMARK
The AIDS Antiviral Screen dataset is a dataset of screens checking tens of thousands of compounds for evidence of anti-HIV activity. The available screen results are chemical graph-structured data of these various compounds.
0 PAPER • NO BENCHMARKS YET