MusicBrainz20K

The MusicBrainz20K dataset for entity resolution and entity clustering is based on real records about songs from the MusicBrainz database. Each record is described with the following attributes: artist, title, album, year and length. The records have been modified with the DAPO [1] data generator. The generated dataset consists of five sources and approximately 20K records describing 10K unique song entities. It contains duplicates for 50% of the original records in two to five sources which are generated with a high degree of corruption to stress-test the entity resolution and clustering approaches.

[1] Hildebrandt, Kai, et al. "Large-scale data pollution with Apache Spark." IEEE Transactions on Big Data 6.2 (2017): 396-411.

Homepage