SMGKM: an Efficient Incremental Algorithm for Clustering Document Collections

Adil M. Bagirov,Sattar Seifollahi,Massimo Piccardi,Ehsan Zare Borzeshi,Bernie Kruger

Lecture notes in computer science（2023）

Cited 0|Views0

No score

Abstract

Given a large unlabeled document collection, the aim of this paper is to develop an accurate and efficient algorithm for solving the clustering problem over this collection. Document collections typically contain tens or hundreds of thousands of documents, with thousands or tens of thousands of features (i.e., distinct words). Most existing clustering algorithms struggle to find accurate solutions on such large data sets. The proposed algorithm overcomes this difficulty by an incremental approach, incrementing the number of clusters progressively from an initial value of one to a set value. At each iteration, the new candidate cluster is initialized using a partitioning approach which is guaranteed to minimize the objective function. Experiments have been carried out over six, diverse datasets and with different evaluation criteria, showing that the proposed algorithm has outperformed comparable state-of-the-art clustering algorithms in all cases.

Translated text

Key words

Document Clustering,Clustering Algorithms,Semi-supervised Clustering,Density-based Clustering,Document Categorization

AI Read Science

Must-Reading Tree

Example

Generate MRT to find the research sequence of this paper

Chat Paper

Summary is being generated by the instructions you defined