GitTables

Introduced by Hulsebos et al. in GitTables: A Large-Scale Corpus of Relational Tables

GitTables is a corpus of currently 1M relational tables extracted from CSV files in GitHub covering 96 topics. Table columns in GitTables have been annotated with more than 2K different semantic types from Schema.org and DBpedia. The column annotations consist of semantic types, hierarchical relations, range types, table domain and descriptions.

The tables were annotated using two methods: the semantic method and syntactic one. This leads to two kinds of annotations which in the metadata of the tables are referred to as syntactic and semantic annotations. The first method annotated 888,678 tables with Schema.org semantic types and 875,630 with DBpedia, while the second method annotated 1,161,117 tables with Schema.org and 1,156,601 with DBpedia semantic types.

Some statistics about the tables are provided in the table below, "Columns" referring to the number of annotated columns and "Classes" to the number of unique DBpedia or Schema.org semantic types used for annotation.

Columns Classes
Syntactic-DBpedia 3,441,251 834
Syntactic-Schema.org 2,671,588 677
Semantic-DBpedia 10,757,184 2,380
Semantic-Schema.org 10,475,155 2,407

Papers


Paper Code Results Date Stars

Dataset Loaders


No data loaders found. You can submit your data loader here.

Tasks


Similar Datasets


License


Modalities


Languages