MetricJoin: Leveraging Metric Properties for Robust Exact Set Similarity Joins.

ICDE(2023)

引用 0|浏览5
暂无评分
摘要
Given two collections of sets, the set similarity join reports all pairs of sets that are within a given distance threshold. State-of-the-art solutions employ an inverted list index and several heuristics to compute the join result efficiently. Prefix-based solutions benefit from infrequent set elements, known as tokens, and spend considerable time scanning long lists if the token frequency is not sufficiently skewed. Partition-based methods are less sensitive to the token distribution but suffer from a significantly larger memory footprint, limiting their applicability as the threshold or the set sizes grow. Solutions from the domain of metric-based similarity search are designed to reduce the overall number of distance computations. Generic metric techniques cannot compete with state-of-the-art similarity joins tailored to sets, which in turn do not exploit metric filter opportunities.We propose MetricJoin, the first exact set similarity join technique that leverages the metric properties of set distance functions. In contrast to its competitors, MetricJoin is robust, i.e., datasets with different characteristics can be joined efficiently in terms of runtime and memory. Our algorithm embeds sets in vector space, organizes long inverted lists in spatial indexes, and employs an effective metric filter to prune unqualified sets. MetricJoin requires only linear space in the collection size and substantially reduces the number of sets that must be considered. In our performance studies, MetricJoin outperforms state-of-the-art solutions by up to an order of magnitude in runtime and generates up to five orders of magnitude fewer candidates.
更多
查看译文
关键词
collection size,distance computations,effective metric filter,generic metric techniques,given distance threshold,infrequent set elements,inverted list index,join result,larger memory footprint,long lists,metric filter opportunities,metric properties,metric-based similarity search,MetricJoin,partition-based methods,prefix-based solutions,reports all pairs,robust exact set similarity joins,set distance functions,set sizes,state-of-the-art similarity,token distribution,token frequency,unqualified sets
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要