TASK	DATASET	MODEL	METRIC NAME	METRIC VALUE	GLOBAL RANK
Blocking	Abt-Buy	Sparkly k=10	Recall	98.1	# 3
Blocking	Abt-Buy	Sparkly k=10	Candidate Set Size	10900	# 4
Blocking	Abt-Buy	Sparkly k=50	Recall	99.2	# 2
Blocking	Abt-Buy	Sparkly k=50	Candidate Set Size	54500	# 6
Blocking	Amazon-Google	Sparkly k=10	Recall	96.8	# 6
Blocking	Amazon-Google	Sparkly k=10	Candidate Set Size	33300	# 2
Blocking	Amazon-Google	Sparkly k=50	Recall	99.2	# 2
Blocking	Amazon-Google	Sparkly k=50	Candidate Set Size	165900	# 6

Badge	Markdown
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/sparkly-a-simple-yet-surprisingly-strong-tf/blocking-on-amazon-google)](https://paperswithcode.com/sota/blocking-on-amazon-google?p=sparkly-a-simple-yet-surprisingly-strong-tf)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/sparkly-a-simple-yet-surprisingly-strong-tf/blocking-on-abt-buy)](https://paperswithcode.com/sota/blocking-on-abt-buy?p=sparkly-a-simple-yet-surprisingly-strong-tf)`

Sparkly: A Simple yet Surprisingly Strong TF/IDF Blocker for Entity Matching

Proceedings of the VLDB Endowment 2023 · Derek Paulsen, Yash Govind, AnHai Doan ·

Blocking is a major task in entity matching. Numerous blocking solutions have been developed, but as far as we can tell, blocking using the well-known tf/idf measure has received virtually no attention. Yet, when we experimented with tf/idf blocking using Lucene, we found it did quite well. So in this paper we examine tf/idf blocking in depth. We develop Sparkly, which uses Lucene to perform top-k tf/idf blocking in a distributed share-nothing fashion on a Spark cluster. We develop techniques to identify good attributes and tokenizers that can be used to block on, making Sparkly completely automatic. We perform extensive experiments showing that Sparkly outperforms 8 state-of-the-art blockers. Finally, we provide an in-depth analysis of Sparkly's performance, regarding both recall/output size and runtime. Our findings suggest that (a) tf/idf blocking needs more attention, (b) Sparkly forms a strong baseline that future blocking work should compare against, and (c) future blocking work should seriously consider top-k blocking, which helps improve recall, and a distributed share-nothing architecture, which helps improve scalability, predictability, and extensibility.

PDF Abstract

Code

Add Remove Mark official

anhaidgroup/sparkly

Tasks

Add Remove

Blocking

Datasets

Amazon-Google Abt-Buy

Results from the Paper

Add Remove

Ranked #2 on Blocking on Amazon-Google

Get a GitHub badge

Task	Dataset	Model	Metric Name	Metric Value	Global Rank	Benchmark
Blocking	Abt-Buy	Sparkly k=10	Recall	98.1	# 3	Compare
Blocking	Abt-Buy	Sparkly k=10	Candidate Set Size	10900	# 4	Compare
Blocking	Abt-Buy	Sparkly k=50	Recall	99.2	# 2	Compare
Blocking	Abt-Buy	Sparkly k=50	Candidate Set Size	54500	# 6	Compare
Blocking	Amazon-Google	Sparkly k=10	Recall	96.8	# 6	Compare
Blocking	Amazon-Google	Sparkly k=10	Candidate Set Size	33300	# 2	Compare
Blocking	Amazon-Google	Sparkly k=50	Recall	99.2	# 2	Compare
Blocking	Amazon-Google	Sparkly k=50	Candidate Set Size	165900	# 6	Compare

Methods

Add Remove

No methods listed for this paper. Add relevant methods here

Edit Social Preview

Sparkly: A Simple yet Surprisingly Strong TF/IDF Blocker for Entity Matching

Code Edit Add Remove Mark official

Tasks Edit Add Remove

Datasets Edit

Results from the Paper Edit Add Remove

Methods Edit Add Remove

Code

Add Remove Mark official

Tasks

Add Remove

Datasets

Results from the Paper

Add Remove

Methods

Add Remove