Sparkly: A Simple yet Surprisingly Strong TF/IDF Blocker for Entity Matching
Blocking is a major task in entity matching. Numerous blocking solutions have been developed, but as far as we can tell, blocking using the well-known tf/idf measure has received virtually no attention. Yet, when we experimented with tf/idf blocking using Lucene, we found it did quite well. So in this paper we examine tf/idf blocking in depth. We develop Sparkly, which uses Lucene to perform top-k tf/idf blocking in a distributed share-nothing fashion on a Spark cluster. We develop techniques to identify good attributes and tokenizers that can be used to block on, making Sparkly completely automatic. We perform extensive experiments showing that Sparkly outperforms 8 state-of-the-art blockers. Finally, we provide an in-depth analysis of Sparkly's performance, regarding both recall/output size and runtime. Our findings suggest that (a) tf/idf blocking needs more attention, (b) Sparkly forms a strong baseline that future blocking work should compare against, and (c) future blocking work should seriously consider top-k blocking, which helps improve recall, and a distributed share-nothing architecture, which helps improve scalability, predictability, and extensibility.
PDF AbstractCode
Tasks
Datasets
Task | Dataset | Model | Metric Name | Metric Value | Global Rank | Benchmark |
---|---|---|---|---|---|---|
Blocking | Abt-Buy | Sparkly k=10 | Recall | 98.1 | # 3 | |
Candidate Set Size | 10900 | # 4 | ||||
Blocking | Abt-Buy | Sparkly k=50 | Recall | 99.2 | # 2 | |
Candidate Set Size | 54500 | # 6 | ||||
Blocking | Amazon-Google | Sparkly k=10 | Recall | 96.8 | # 6 | |
Candidate Set Size | 33300 | # 2 | ||||
Blocking | Amazon-Google | Sparkly k=50 | Recall | 99.2 | # 2 | |
Candidate Set Size | 165900 | # 6 |