The Flickr30k dataset contains 31,000 images collected from Flickr, together with 5 reference sentences provided by human annotators.
718 PAPERS • 9 BENCHMARKS
The Reddit dataset is a graph dataset from Reddit posts made in the month of September, 2014. The node label in this case is the community, or “subreddit”, that a post belongs to. 50 large communities have been sampled to build a post-to-post graph, connecting posts if the same user comments on both. In total this dataset contains 232,965 posts with an average degree of 492. The first 20 days are used for training and the remaining days for testing (with 30% used for validation). For features, off-the-shelf 300-dimensional GloVe CommonCrawl word vectors are used.
582 PAPERS • 14 BENCHMARKS
NELL is a dataset built from the Web via an intelligent agent called Never-Ending Language Learner. This agent attempts to learn over time to read the web. NELL has accumulated over 50 million candidate beliefs by reading the web, and it is considering these at different levels of confidence. NELL has high confidence in 2,810,379 of these beliefs.
166 PAPERS • 4 BENCHMARKS
The data was collected from the English Wikipedia (December 2018). These datasets represent page-page networks on specific topics (chameleons, crocodiles and squirrels). Nodes represent articles and edges are mutual links between them. The edges csv files contain the edges - nodes are indexed from 0. The features json files contain the features of articles - each key is a page id, and node features are given as lists. The presence of a feature in the feature list means that an informative noun appeared in the text of the Wikipedia article. The target csv contains the node identifiers and the average monthly traffic between October 2017 and November 2018 for each page. For each page-page network we listed the number of nodes an edges with some other descriptive statistics.
154 PAPERS • 2 BENCHMARKS
Node classification on Penn94
45 PAPERS • 2 BENCHMARKS
node classification on genius
35 PAPERS • 2 BENCHMARKS
node classification on twitch-gamers
23 PAPERS • 2 BENCHMARKS
Node classification on Chameleon with the fixed 48%/32%/20% splits provided by Geom-GCN.
18 PAPERS • 2 BENCHMARKS
Node classification on Squirrel with the fixed 48%/32%/20% splits provided by Geom-GCN.
17 PAPERS • 2 BENCHMARKS
Node classification on Cornell with the fixed 48%/32%/20% splits provided by Geom-GCN.
16 PAPERS • 2 BENCHMARKS
Node classification on Cornell with 60%/20%/20% random splits for training/validation/test.
Node classification on Film with 60%/20%/20% random splits for training/validation/test.
16 PAPERS • 1 BENCHMARK
Node classification on Squirrel with 60%/20%/20% random splits for training/validation/test.
Node classification on Citeseer with the fixed 48%/32%/20% splits provided by Geom-GCN.
15 PAPERS • 1 BENCHMARK
Node classification on Cora with the fixed 48%/32%/20% splits provided by Geom-GCN.
Node classification on PubMed with the fixed 48%/32%/20% splits provided by Geom-GCN.
Node classification on PubMed with 60%/20%/20% random splits for training/validation/test.
Node classification on Wisconsin with the fixed 48%/32%/20% splits provided by Geom-GCN.
15 PAPERS • 2 BENCHMARKS
Node classification on Film with the fixed 48%/32%/20% splits provided by Geom-GCN.
14 PAPERS • 2 BENCHMARKS
Node classification on Texas with the fixed 48%/32%/20% splits provided by Geom-GCN.
Yelp-Fraud is a multi-relational graph dataset built upon the Yelp spam review dataset, which can be used in evaluating graph-based node classification, fraud detection, and anomaly detection models.
10 PAPERS • 2 BENCHMARKS
Leonardo Filipe Rodrigues Ribeiro, Pedro H. P. Saverese, and Daniel R. Figueiredo. struc2vec: Learning node representations from structural identity.
9 PAPERS • 1 BENCHMARK
Brazil Air-Traffic
8 PAPERS • 2 BENCHMARKS
This webgraph is a page-page graph of verified Facebook sites. Nodes represent official Facebook pages while the links are mutual likes between sites. Node features are extracted from the site descriptions that the page owners created to summarize the purpose of the site. This graph was collected through the Facebook Graph API in November 2017 and restricted to pages from 4 categories which are defined by Facebook. These categories are: politicians, governmental organizations, television shows and companies. The task related to this dataset is multi-class node classification for the 4 site categories.
7 PAPERS • NO BENCHMARKS YET
Context There's a story behind every dataset and here's your opportunity to share yours.
7 PAPERS • 3 BENCHMARKS
Amazon-Fraud is a multi-relational graph dataset built upon the Amazon review dataset, which can be used in evaluating graph-based node classification, fraud detection, and anomaly detection models.
6 PAPERS • 2 BENCHMARKS
MuMiN is a misinformation graph dataset containing rich social media data (tweets, replies, users, images, articles, hashtags), spanning 21 million tweets belonging to 26 thousand Twitter threads, each of which have been semantically linked to 13 thousand fact-checked claims across dozens of topics, events and domains, in 41 different languages, spanning more than a decade.
4 PAPERS • 3 BENCHMARKS
The data was collected from the music streaming service Deezer (November 2017). These datasets represent friendship networks of users from 3 European countries. Nodes represent the users and edges are the mutual friendships. We reindexed the nodes in order to achieve a certain level of anonimity. The csv files contain the edges -- nodes are indexed from 0. The json files contain the genre preferences of users -- each key is a user id, the genres loved are given as lists. Genre notations are consistent across users. In each dataset users could like 84 distinct genres. Liked genre lists were compiled based on the liked song lists. The countries included are Romania, Croatia and Hungary. For each dataset we listed the number of nodes an edges.
3 PAPERS • NO BENCHMARKS YET
Classifying all cells in an organ is a relevant and difficult problem from plant developmental biology. We here abstract the problem into a new benchmark for node classification in a geo-referenced graph. Solving it requires learning the spatial layout of the organ including symmetries. To allow the convenient testing of new geometrical learning methods, the benchmark of Arabidopsis thaliana ovules is made available as a PyTorch data loader, along with a large number of precomputed features.
1 PAPER • 1 BENCHMARK
A new fraud detection dataset FDCompCN for detecting financial statement fraud of companies in China. We construct a multi-relation graph based on the supplier, customer, shareholder, and financial information disclosed in the financial statements of Chinese companies. These data are obtained from the China Stock Market and Accounting Research (CSMAR) database. We select samples between 2020 and 2023, including 5,317 publicly listed Chinese companies traded on the Shanghai, Shenzhen, and Beijing Stock Exchanges.
The dataset contains constructed multi-modal features (visual and textual), pseudo-labels (on heritage values and attributes), and graph structures (with temporal, social, and spatial links) constructed using User-Generated Content data collected from Flickr social media platform in three global cities containing UNESCO World Heritage property (Amsterdam, Suzhou, Venice). The motivation of data collection in this project is to provide datasets that could be both directly applicable for ML communities as test-bed, and theoretically informative for heritage and urban scholars to draw conclusions on for planning decision-making.
1 PAPER • NO BENCHMARKS YET
The analysis of building models for usable area, building safety, and energy efficiency requires accurate classification data of spaces and space elements. To reduce input model preparation effort and errors, automated classification of spaces and space elements is desirable. Although existing space function classifiers use space adjacency or connectivity graphs as input, the application of Graph Deep Learning (GDL) to space layout element classification has not been extensively researched due to the lack of suitable datasets. To bridge this gap, we introduce a dataset named SAGC-A68, which comprises access graphs automatically generated from 68 digital 3D models of space layouts of apartment buildings designed or built between 1952 and 2019 in 13 countries. Each access graph contains nodes representing spaces and space elements and edges representing the connection between them. Nodes are uniquely identified and characterized by 16 features including “Position X”, “Position Y”, “Posit