🔔 Share your dataset with the ML community!

Filter by Modality (clear)

Filter by Task (clear)

Filter by Language

54 dataset results for Node Classification AND Graphs

The PubMed dataset consists of 19717 scientific publications from PubMed database pertaining to diabetes classified into one of three classes. The citation network consists of 44338 links. Each publication in the dataset is described by a TF/IDF weighted word vector from a dictionary which consists of 500 unique words.

1,093 PAPERS • 24 BENCHMARKS

OGB (Open Graph Benchmark)

The Open Graph Benchmark (OGB) is a collection of realistic, large-scale, and diverse benchmark datasets for machine learning on graphs. OGB datasets are automatically downloaded, processed, and split using the OGB Data Loader. The model performance can be evaluated using the OGB Evaluator in a unified manner. OGB is a community-driven initiative in active development.

846 PAPERS • 16 BENCHMARKS

The Reddit dataset is a graph dataset from Reddit posts made in the month of September, 2014. The node label in this case is the community, or “subreddit”, that a post belongs to. 50 large communities have been sampled to build a post-to-post graph, connecting posts if the same user comments on both. In total this dataset contains 232,965 posts with an average degree of 492. The first 20 days are used for training and the remaining days for testing (with 30% used for validation). For features, off-the-shelf 300-dimensional GloVe CommonCrawl word vectors are used.

603 PAPERS • 13 BENCHMARKS

Cora

The Cora dataset consists of 2708 scientific publications classified into one of seven classes. The citation network consists of 5429 links. Each publication in the dataset is described by a 0/1-valued word vector indicating the absence/presence of the corresponding word from the dictionary. The dictionary consists of 1433 unique words.

510 PAPERS • 20 BENCHMARKS

Citeseer

The CiteSeer dataset consists of 3312 scientific publications classified into one of six classes. The citation network consists of 4732 links. Each publication in the dataset is described by a 0/1-valued word vector indicating the absence/presence of the corresponding word from the dictionary. The dictionary consists of 3703 unique words.

327 PAPERS • 14 BENCHMARKS

MUTAG

In particular, MUTAG is a collection of nitroaromatic compounds and the goal is to predict their mutagenicity on Salmonella typhimurium. Input graphs are used to represent chemical compounds, where vertices stand for atoms and are labeled by the atom type (represented by one-hot encoding), while edges between vertices represent bonds between the corresponding atoms. It includes 188 samples of chemical compounds with 7 discrete node labels.

249 PAPERS • 3 BENCHMARKS

DBLP

DBLP (Citation Network Dataset)

The DBLP is a citation network dataset. The citation data is extracted from DBLP, ACM, MAG (Microsoft Academic Graph), and other sources. The first version contains 629,814 papers and 632,752 citations. Each paper is associated with abstract, authors, year, venue, and title. The data set can be used for clustering with network and side information, studying influence in the citation network, finding the most influential papers, topic modeling analysis, etc.

209 PAPERS • 5 BENCHMARKS

MoleculeNet

MoleculeNet is a large scale benchmark for molecular machine learning. MoleculeNet curates multiple public datasets, establishes metrics for evaluation, and offers high quality open-source implementations of multiple previously proposed molecular featurization and learning algorithms (released as part of the DeepChem open source library). MoleculeNet benchmarks demonstrate that learnable representations are powerful tools for molecular machine learning and broadly offer the best performance.

186 PAPERS • 1 BENCHMARK

Wiki Squirrel (Wikipedia Squirrel)

The data was collected from the English Wikipedia (December 2018). These datasets represent page-page networks on specific topics (chameleons, crocodiles and squirrels). Nodes represent articles and edges are mutual links between them. The edges csv files contain the edges - nodes are indexed from 0. The features json files contain the features of articles - each key is a page id, and node features are given as lists. The presence of a feature in the feature list means that an informative noun appeared in the text of the Wikipedia article. The target csv contains the node identifiers and the average monthly traffic between October 2017 and November 2018 for each page. For each page-page network we listed the number of nodes an edges with some other descriptive statistics.

166 PAPERS • 2 BENCHMARKS

CLUSTER

CLUSTER is a node classification tasks generated with Stochastic Block Models, which is widely used to model communities in social networks by modulating the intra- and extra-communities connections, thereby controlling the difficulty of the task. CLUSTER aims at identifying community clusters in a semi-supervised setting.

143 PAPERS • 1 BENCHMARK

PATTERN

PATTERN is a node classification tasks generated with Stochastic Block Models, which is widely used to model communities in social networks by modulating the intra- and extra-communities connections, thereby controlling the difficulty of the task. PATTERN tests the fundamental graph task of recognizing specific predetermined subgraphs.

127 PAPERS • 1 BENCHMARK

WebKB

WebKB is a dataset that includes web pages from computer science departments of various universities. 4,518 web pages are categorized into 6 imbalanced categories (Student, Faculty, Staff, Department, Course, Project). Additionally there is Other miscellanea category that is not comparable to the rest.

96 PAPERS • 6 BENCHMARKS

Wiki-CS

Wiki-CS is a Wikipedia-based dataset for benchmarking Graph Neural Networks. The dataset is constructed from Wikipedia categories, specifically 10 classes corresponding to branches of computer science, with very high connectivity. The node features are derived from the text of the corresponding articles. They were calculated as the average of pretrained GloVe word embeddings (Pennington et al., 2014), resulting in 300-dimensional node features.

74 PAPERS • 2 BENCHMARKS

AMiner

The AMiner Dataset is a collection of different relational datasets. It consists of a set of relational networks such as citation networks, academic social networks or topic-paper-autor networks among others.

73 PAPERS • 1 BENCHMARK

Friendster

Friendster is an on-line gaming network. Before re-launching as a game website, Friendster was a social networking site where users can form friendship edge each other. Friendster social network also allows users form a group which other members can then join. The Friendster dataset consist of ground-truth communities (based on user-defined groups) and the social network from induced subgraph of the nodes that either belong to at least one community or are connected to other nodes that belong to at least one community.

61 PAPERS • NO BENCHMARKS YET

Long Range Graph Benchmark (LRGB)

The Long Range Graph Benchmark (LRGB) is a collection of 5 graph learning datasets that arguably require long-range reasoning to achieve strong performance in a given task. The 5 datasets in this benchmark can be used to prototype new models that can capture long range dependencies in graphs.

51 PAPERS • 5 BENCHMARKS

Penn94

Node classification on Penn94

48 PAPERS • 2 BENCHMARKS

genius

node classification on genius

34 PAPERS • 2 BENCHMARKS

BioGRID

BioGRID (Biological General Repository for Interaction Datasets)

BioGRID is a biomedical interaction repository with data compiled through comprehensive curation efforts. The current index is version 4.2.192 and searches 75,868 publications for 1,997,840 protein and genetic interactions, 29,093 chemical interactions and 959,750 post translational modifications from major model organism species.

33 PAPERS • 2 BENCHMARKS

OGB-LSC (OGB Large-Scale Challenge)

OGB Large-Scale Challenge (OGB-LSC) is a collection of three real-world datasets for advancing the state-of-the-art in large-scale graph ML. OGB-LSC provides graph datasets that are orders of magnitude larger than existing ones and covers three core graph learning tasks -- link prediction, graph regression, and node classification.

32 PAPERS • 3 BENCHMARKS

questions

Questions is an interaction graph of users of a question-answering website based on data provided by Yandex Q.

22 PAPERS • 1 BENCHMARK

roman-empire

Roman-empire is a word dependency graph based on the Roman Empire article from the English Wikipedia.

22 PAPERS • 1 BENCHMARK

twitch-gamers

node classification on twitch-gamers

22 PAPERS • 2 BENCHMARKS

Chameleon (48%/32%/20% fixed splits)

Node classification on Chameleon with the fixed 48%/32%/20% splits provided by Geom-GCN.

18 PAPERS • 2 BENCHMARKS

Yeast

Yeast dataset consists of a protein-protein interaction network. Interaction detection methods have led to the discovery of thousands of interactions between proteins, and discerning relevance within large-scale data sets is important to present-day biology.

18 PAPERS • NO BENCHMARKS YET

Film (60%/20%/20% random splits)

Node classification on Film with 60%/20%/20% random splits for training/validation/test.

17 PAPERS • 1 BENCHMARK

PubMed (60%/20%/20% random splits)

Node classification on PubMed with 60%/20%/20% random splits for training/validation/test.

17 PAPERS • 1 BENCHMARK

Squirrel (48%/32%/20% fixed splits)

Node classification on Squirrel with the fixed 48%/32%/20% splits provided by Geom-GCN.

17 PAPERS • 2 BENCHMARKS

Squirrel (60%/20%/20% random splits)

Node classification on Squirrel with 60%/20%/20% random splits for training/validation/test.

17 PAPERS • 1 BENCHMARK

tolokers

Tolokers is a crowdsourcing platform workers network based on data provided by Toloka.

17 PAPERS • 1 BENCHMARK

Cornell (48%/32%/20% fixed splits)

Node classification on Cornell with the fixed 48%/32%/20% splits provided by Geom-GCN.

16 PAPERS • 2 BENCHMARKS

Cornell (60%/20%/20% random splits)

Node classification on Cornell with 60%/20%/20% random splits for training/validation/test.

16 PAPERS • 2 BENCHMARKS

minesweeper

minesweeper is a synthetic graph emulating the eponymous game.

16 PAPERS • 1 BENCHMARK

Citeseer (48%/32%/20% fixed splits)

Node classification on Citeseer with the fixed 48%/32%/20% splits provided by Geom-GCN.

15 PAPERS • 1 BENCHMARK

Cora (48%/32%/20% fixed splits)

Node classification on Cora with the fixed 48%/32%/20% splits provided by Geom-GCN.

15 PAPERS • 1 BENCHMARK

PubMed (48%/32%/20% fixed splits)

Node classification on PubMed with the fixed 48%/32%/20% splits provided by Geom-GCN.

15 PAPERS • 1 BENCHMARK

Wisconsin (48%/32%/20% fixed splits)

Node classification on Wisconsin with the fixed 48%/32%/20% splits provided by Geom-GCN.

15 PAPERS • 2 BENCHMARKS

Film(48%/32%/20% fixed splits)

Node classification on Film with the fixed 48%/32%/20% splits provided by Geom-GCN.

14 PAPERS • 2 BENCHMARKS

Texas (48%/32%/20% fixed splits)

Node classification on Texas with the fixed 48%/32%/20% splits provided by Geom-GCN.

14 PAPERS • 2 BENCHMARKS

amazon-ratings

amazon-ratings is a product co-purchasing network based on data from SNAP datasets

14 PAPERS • 1 BENCHMARK

Yelp-Fraud (Multi-relational Graph Dataset for Yelp Spam Review Detection)

Yelp-Fraud is a multi-relational graph dataset built upon the Yelp spam review dataset, which can be used in evaluating graph-based node classification, fraud detection, and anomaly detection models.

10 PAPERS • 2 BENCHMARKS

USA Air-Traffic

Leonardo Filipe Rodrigues Ribeiro, Pedro H. P. Saverese, and Daniel R. Figueiredo. struc2vec: Learning node representations from structural identity.

9 PAPERS • 1 BENCHMARK

Brazil Air-Traffic

8 PAPERS • 2 BENCHMARKS

Facebook Page-Page

This webgraph is a page-page graph of verified Facebook sites. Nodes represent official Facebook pages while the links are mutual likes between sites. Node features are extracted from the site descriptions that the page owners created to summarize the purpose of the site. This graph was collected through the Facebook Graph API in November 2017 and restricted to pages from 4 categories which are defined by Facebook. These categories are: politicians, governmental organizations, television shows and companies. The task related to this dataset is multi-class node classification for the 4 site categories.

7 PAPERS • NO BENCHMARKS YET

Amazon-Fraud (Multi-relational Graph Dataset for Amazon Fraudulent Account Detection)

Amazon-Fraud is a multi-relational graph dataset built upon the Amazon review dataset, which can be used in evaluating graph-based node classification, fraud detection, and anomaly detection models.

6 PAPERS • 2 BENCHMARKS

MuMiN

MuMiN is a misinformation graph dataset containing rich social media data (tweets, replies, users, images, articles, hashtags), spanning 21 million tweets belonging to 26 thousand Twitter threads, each of which have been semantically linked to 13 thousand fact-checked claims across dozens of topics, events and domains, in 41 different languages, spanning more than a decade.

4 PAPERS • 3 BENCHMARKS

Deezer User Networks

The data was collected from the music streaming service Deezer (November 2017). These datasets represent friendship networks of users from 3 European countries. Nodes represent the users and edges are the mutual friendships. We reindexed the nodes in order to achieve a certain level of anonimity. The csv files contain the edges -- nodes are indexed from 0. The json files contain the genre preferences of users -- each key is a user id, the genres loved are given as lists. Genre notations are consistent across users. In each dataset users could like 84 distinct genres. Liked genre lists were compiled based on the liked song lists. The countries included are Romania, Croatia and Hungary. For each dataset we listed the number of nodes an edges.

3 PAPERS • NO BENCHMARKS YET

Placenta

Placenta is a benchmark dataset for node classification in an underexplored domain: predicting microanatomical tissue structures from cell graphs in placenta histology whole slide images. Cell graphs are large (>1 million nodes per image), node features are varied (64-dimensions of 11 types of cells), class labels are imbalanced (9 classes ranging from 0.21% of the data to 40.0%), and cellular communities cluster into heterogeneously distributed tissues of widely varying sizes (from 11 nodes to 44,671 nodes for a single structure).

2 PAPERS • 1 BENCHMARK