Pylon Benchmark (Pylon Table Union Search Benchmark)

Introduced by Cong et al. in Pylon: Semantic Table Union Search in Data Lakes

We create a new dataset from GitTables, a data lake of 1.7M tables extracted from CSV files on GitHub. The benchmark comprises 1,746 tables including union-able table subsets under topics selected from Schema.org: scholarly article, job posting, and music playlist. We end up with these three topics since we can find a fair number of union-able tables of them from diverse sources in the corpus (we can easily find union-able tables from a single source but they are less interesting for table union search as simple syntactic methods can identify all of them because of the same schema and consistent value representations).

Papers


Paper Code Results Date Stars

Dataset Loaders


No data loaders found. You can submit your data loader here.

Tasks


License


  • CC BY

Modalities


Languages