Drawing CoCo Core-Sets from Incomplete Relational Data.
APWeb/WAIM (1)(2019)
摘要
Incompleteness is a pervasive issue and brings challenges to answer queries with high-quality tuples. Since not all missing values can be repaired by complete values, it is crucial to provide completeness of a query answer for further decisions. To estimate such completeness results fast and objectively, CoCo core-sets are proposed in this paper. A CoCo core-set is a subset of an incomplete relational dataset, which contains tuples providing enough complete values on attributes of interest and whose ratio of complete values is close to that of the entire dataset. Based on CoCo core-sets reliable mechanisms can be designed to estimate query completeness on incomplete datasets. This paper investigates the problem of drawing CoCo core-sets on incomplete relational data. To the best of our knowledge, there is no such a proposal in the past. (1) We formalize the problem of drawing CoCo core-sets, and prove that the problem is NP-Complete. (2) An efficient approximate algorithm to draw an approximate CoCo core-set is proposed, where uniform sampling technique is employed to efficiently select tuples for coverage and completeness. (3) Analysis of the proposed approximate algorithm shows both coverage of attributes of interest and the relative error of ratio of complete attribute values between drawn tuples and the entire data can be within a given relative error bound. (4) Experiments on both real-world and synthetic datasets demonstrate that the algorithm can effectively and efficiently draw tuples preserving properties of entire datasets for query completeness estimation, and have a well scalability.
更多查看译文
关键词
Data quality, Data completeness, Query completeness, Incomplete data, CoCo core-sets
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络