How to reduce the search space of Entity Resolution: with Joins, Blocking or Nearest Neighbor search? [Experiment, Analysis & Benchmark Papers]


引用 0|浏览2
Entity Resolution is the task of identifying pairs of entity profiles that represent the same real-world object. To avoid checking a quadratic number of entity pairs, various filtering techniques have been proposed that fall into three main categories: (i) blocking workflows group together entity profiles with identical or similar signatures, (ii) string similarity joins identify entity pairs that exceed a user-defined similarity threshold, and (iii) nearest-neighbor methods convert all entity profiles into vectors and identify the closest ones to every query entity. Unfortunately, the main techniques from these different categories have rarely been compared in the literature and, thus, their relative performance is unknown. We perform the first systematic experimental study that investigates the relative performance of the main representatives per category over 10 real-world datasets. Comparing techniques from different categories turns out to be a non-trivial task due to the various configuration parameters that are hard to fine-tune, but have a significant impact on performance. We consider a plethora of parameter configurations, optimizing each technique with respect to recall and precision targets. Both schema-agnostic and schema-based settings are evaluated. The experimental results provide novel insights into the effectiveness and time efficiency of the considered techniques, which are condensed in a systematic overview. PVLDB Reference Format: George Papadakis, Marco Fisichella, Franziska Schoger, George Mandilaras, Nikolaus Augsten, Wolfgang Nejdl. How to reduce the search space of Entity Resolution? With Joins, Blocking or Nearest Neighbor search? PVLDB, 14(1): XXX-XXX, 2020.
AI 理解论文
Chat Paper