k-means clustering and kNN classification based on negative databases

APPLIED SOFT COMPUTING(2021)

引用 30|浏览25
暂无评分
摘要
Nowadays, privacy protection has become an important issue in data mining. k-means clustering and kNN classification are two popular data mining algorithms, which have been widely studied in the past decade. In this paper, we mainly study the problem of privacy protection during k-means clustering and kNN classification. Negative database (NDB) is a new type of data representation which can protect privacy as well as support distance estimation, so it is promising to apply NDBs to privacy-preserving k-means clustering and kNN classification. Existing privacy-preserving k-means clustering and kNN classification algorithms based on NDBs could effectively protect data privacy, but their clustering or classification performance has a non-negligible degradation. To alleviate this problem, we propose a new NDB generation algorithm (named QK-hidden algorithm). Compared to existing NDB generation algorithms, the QK-hidden algorithm employs a new set of parameters to control the selection of different bits when generating NDB records, and this enables a fine-grained control of the accuracy of distance estimation. By the QK-hidden algorithm, private data are converted into NDBs before inputting to k-means and kNN algorithms. Then, we propose an approach specialized for estimating Euclidean distance from the NDBs generated by the QK-hidden algorithm, and this approach is used instead of calculating Euclidean distance on plain text. Moreover, we design a model for estimating the cluster centers from NDBs for the k-means algorithm. The remaining steps of k-means and kNN algorithms keep the same to original versions. Experimental results demonstrate that the proposed algorithms could achieve more than 30% improvement on the clustering performance (in terms of average Davies- Bouldin Index) and 9% improvement on the classification performance (in terms of macro precision, recall and F1 score) in comparison with some existing works. (C) 2021 Elsevier B.V. All rights reserved.
更多
查看译文
关键词
Privacy protection, Negative database, kNN classification, Euclidean distance
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要