Nonparametric bayesian models for unsupervised learning

Nonparametric bayesian models for unsupervised learning(2011)

引用 23|浏览7
暂无评分
摘要
Unsupervised learning is an important topic in machine learning. In particular, clustering is an unsupervised learning problem that arises in a variety of applications for data analysis and mining. Unfortunately, clustering is an ill-posed problem and, as such, a challenging one: no ground-truth that can be used to validate clustering results is available. Two issues arise as a consequence. Various clustering algorithms embed their own bias resulting from different optimization criteria. As a result, each algorithm may discover different patterns in a given dataset. The second issue concerns the setting of parameters. In clustering, parameter setting controls the characterization of individual clusters, and the total number of clusters in the data.Clustering ensembles have been proposed to address the issue of different biases induced by various algorithms. Clustering ensembles combine different clustering results, and can provide solutions that are robust against spurious elements in the data. Although clustering ensembles provide a significant advance, they do not address satisfactorily the model selection and the parameter tuning problem. Bayesian approaches have been applied to clustering to address the parameter tuning and model selection issues. Bayesian methods provide a principled way to address these problems by assuming prior distributions on model parameters. Prior distributions assign low probabilities to parameter values which are unlikely. Therefore they serve as regularizers for modeling parameters, and can help avoid over-fitting. In addition, the marginal likelihood is used by Bayesian approaches as the criterion for model selection. Although Bayesian methods provide a principled way to perform parameter tuning and model selection, the key question "How many clusters?" is still open. This is a fundamental question for model selection. A special kind of Bayesian methods, nonparametric Bayesian approaches, have been proposed to address this important model selection issue. Unlike parametric Bayesian models, for which the number of parameters is finite and fixed, nonparametric Bayesian models allow the number of parameters to grow with the number of observations. After observing the data, nonparametric Bayesian models fit the data with finite dimensional parameters. An additional issue with clustering is high dimensionality. High-dimensional data pose a difficult challenge to the clustering process. A common scenario with high-dimensional data is that clusters may exist in different subspaces comprised of different combinations of features (dimensions). In other words, data points in a cluster may be similar to each other along a subset of dimensions, but not in all dimensions. People have proposed subspace clustering techniques, a.k.a. co-clustering or bi-clustering, to address the dimensionality issue (here, I use the term co-clustering). Like clustering, also co-clustering suffers from the ill-posed nature and the lack of ground-truth to validate the results.
更多
查看译文
关键词
clustering process,High-dimensional data,Bayesian approach,nonparametric Bayesian model,subspace clustering technique,unsupervised learning,clustering result,Nonparametric bayesian model,Clustering ensemble,model selection,different clustering result,Bayesian method
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要