A distributed storage MLCS algorithm with time efficient upper bound and precise lower bound

Information Sciences(2022)

引用 0|浏览6
暂无评分
摘要
Finding the longest common subsequence of multiple sequences (i.e., the MLCS problem) is a key issue in data mining, such as pattern recognition of bio-sequences (e.g., DNA and protein sequences), diagnosis of diseases, and deciphering human genetic codes. The length of the longest common subsequence is often used to measure the similarity between sequences. Current state-of-the-art algorithms for the MLCS problem often transform the problem into finding the longest path in a directed acyclic graph (DAG); however, their fatal flaw is that the time cost or constructed DAG is too large. Aiming at solving the problem of the existing algorithms, this paper proposes a time-efficient upper bound estimation method that can estimate the upper bound for the length of any path through a node in much less time, and then proposes an adaptive lower bound estimation strategy that can estimate the more precise lower bound for the length of the longest common subsequences. This study also describes a distributed storage scheme that stores most parts of the DAG in external devices and only a few parts in memory. A comparison with several state-of-the-art algorithms in the experiments indicates that the proposed algorithm is more efficient and effective.
更多
查看译文
关键词
Multiple longest common subsequences problem,Distributed storage,Biological data mining
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要