Optimization of scientific publications clustering with ensemble approach for topic extraction

SCIENTOMETRICS(2023)

引用 2|浏览17
暂无评分
摘要
The continually developing Internet generates a considerable amount of text data. When attempting to extract general topics or themes from a massive corpus of documents, dealing with such a large volume of text data in an unstructured format is a big problem. Text document clustering (TDC) is a technique for grouping texts based on their content similarity. Partitioning text collection based on the documents’ content significance is one of the most challenging tasks at TDC. This study proposes the Bare-Bones Based Salp Swarm Algorithm (BBSSA) to solve the problem of TDC. In addition, to extract the topics from the clusters, an ensemble approach for automatic topic extraction (TE) is proposed. The proposed BBSSA and the ensemble TE approach are tested using six standard benchmarks and six scientific publishing datasets from top QS ranking UAE universities. BBSSA’s findings are compared with sixteen well-known techniques, including eleven metaheuristic algorithms, such as the Whale Optimization Algorithm (WOA), Firefly Algorithm (FFA), Bat Algorithm (BAT), Harmony Search (HS), Particle Swarm Optimization (PSO), Genetic Algorithm (GA), Multi-Verse Optimizer (MVO), Grey Wolf Optimizer (GWO), Moth-Flame Optimization (MFO), Krill Herd Algorithm (KHA), SSA, and five clustering methods, such as K-means++, K-means, Density-based Spatial Clustering of Applications with Noise (DBSCAN), Spectral, and Agglomerative. The results of the ensemble TE approach are compared with those of seven well-known statistical methods, including Mutual Information (MI), TextRank (TR), Co-Occurrence Statistical Information-based Keyword Extraction (CSI), Term Frequency-Inverse Document Frequency (TF-IDF), most frequent based keyword extraction (TF), YAKE!, and RAKE. According to the experiments, the BBSSA outperforms all other approaches and is exceedingly competitive. The results also reveal that for most datasets, the proposed ensemble TE strategy outperforms all existing TE methods based on external metrics. Thus, the ensemble TE approach can be seen as a supplement to the other methods.
更多
查看译文
关键词
Scientific publications clustering,Topic extraction,Ensemble method,Salp swarm algorithm,Bare Bones
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要