On Memory and I/O Efficient Duplication Detection for Multiple Self-clean Data Sources

DASFAA'10 Proceedings of the 15th international conference on Database systems for advanced applications（2010）

引用 0|浏览0

暂无评分

摘要

In this paper, we propose efficient algorithms for duplicate detection from multiple data sources that are themselves duplicate-free. When developing these algorithms, we take the full consideration of various possible cases given the workload of data sources to be cleaned and the available memory. These algorithms are memory and I/O efficient, being able to reduce the number of pair-wise record comparison and minimize the total page access cost involved in the cleaning process. Experimental evaluation demonstrates that the algorithms we propose are efficient and are able to achieve better performance than SNM and random access methods.

查看译文

关键词

Memory Constraint,Sorting Order,Page Access,Pairwise Comparison Method,Record Comparison

AI 理解论文

溯源树

样例

生成溯源树，研究论文发展脉络

Chat Paper

正在生成论文摘要