A Trie Based Set Similarity Query Algorithm

Lianyin Jia,Junzhuo Tang,Mengjuan Li,Runxin Li, Jiaman Ding,Yinong Chen

Mathematics（2023）

引用 0|浏览10

暂无评分

摘要

Set similarity query is a primitive for many applications, such as data integration, data cleaning, and gene sequence alignment. Most of the existing algorithms are inverted index based, they usually filter unqualified sets one by one and do not have sufficient support for duplicated sets, thus leading to low efficiency. To solve this problem, this paper designs T-starTrie, an efficient trie based index for set similarity query, which can naturally group sets with the same prefix into one node, and can filter all sets corresponding to the node at a time, thereby significantly improving the candidates generation efficiency. In this paper, we find that the set similarity query problem can be transformed into matching nodes of the first-layer (FMNodes) detecting problem on T-starTrie. Therefore, an efficient FLMNode detection algorithm is designed. Based on this, an efficient set similarity query algorithm, TT-SSQ, is implemented by developing a variety of filtering techniques. Experimental results show that TT-SSQ can be up to 3.10x faster than existing algorithms.

查看译文

关键词

set similarity query,T-starTrie,FMNodes,TT-SSQ

AI 理解论文

溯源树

样例

生成溯源树，研究论文发展脉络

Chat Paper

正在生成论文摘要