Power-law based estimation of set similarity join size

PVLDB(2009)

引用 41|浏览40
暂无评分
摘要
We propose a novel technique for estimating the size of set similarity join. The proposed technique relies on a succinct representation of sets using Min-Hash signatures. We exploit frequent patterns in the signatures for the Set Similarity Join (SSJoin) size estimation by counting their support. However, there are overlaps among the counts of signature patterns and we need to use the set Inclusion-Exclusion (IE) principle. We develop a novel lattice-based counting method for efficiently evaluating the IE principle. The proposed counting technique is linear in the lattice size. To make the mining process very light-weight, we exploit a recently discovered Power-law relationship of pattern count and frequency. Extensive experimental evaluations show the proposed technique is capable of accurate and efficient estimation.
更多
查看译文
关键词
min-hash signature,novel technique,set similarity,efficient estimation,size estimation,set inclusion-exclusion,proposed technique,ie principle,lattice size,power-law relationship,power law
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要