1 code implementation • Proceedings of the VLDB Endowment 2023 • Derek Paulsen, Yash Govind, AnHai Doan
We develop Sparkly, which uses Lucene to perform top-k tf/idf blocking in a distributed share-nothing fashion on a Spark cluster.
Ranked #2 on Blocking on Amazon-Google
1 code implementation • Proceedings of the VLDB Endowment 2021 • Saravanan Thirumuruganathan, Han Li, Nan Tang, Mourad Ouzzani, Yash Govind, Derek Paulsen, Glenn Fung, AnHai Doan
In this paper, we develop the DeepBlocker framework that significantly advances the state of the art in applying DL to blocking for EM.
Ranked #5 on Blocking on Abt-Buy
1 code implementation • 1 Apr 2020 • Yuliang Li, Jinfeng Li, Yoshihiko Suhara, AnHai Doan, Wang-Chiew Tan
Our experiments show that a straightforward application of language models such as BERT, DistilBERT, or RoBERTa pre-trained on large text corpora already significantly improves the matching quality and outperforms previous state-of-the-art (SOTA), by up to 29% of F1 score on benchmark datasets.
Ranked #2 on Entity Resolution on WDC Watches-xlarge
1 code implementation • SIGMOD: International Conference on Management of Data 2018 • Sidharth Mudgal, Han Li, Theodoros Rekatsinas, AnHai Doan, Youngchoon Park, Ganesh Krishnan, Rohit Deep, Esteban Arcaute, Vijay Raghavendra
Entity matching (EM) finds data instances that refer to the same real-world entity.
Ranked #8 on Entity Resolution on Amazon-Google
no code implementations • 29 Sep 2017 • AnHai Doan, Adel Ardalan, Jeffrey R. Ballard, Sanjib Das, Yash Govind, Pradap Konda, Han Li, Erik Paulson, Paul Suganthan G. C., Haojun Zhang
They provide tools to address the "pain points" of the steps, and tools are built on top of the Python data science and Big Data ecosystem (PyData).
Databases