Short Text Document Clustering using Distributed Word Representation and Document Distance

Walailak Journal of Science and Technology (WJST)(2018)

引用 0|浏览0
暂无评分
摘要
This paper presents a method for clustering short text documents, such as instant messages, SMS, or news headlines. Vocabularies in the texts are expanded using external knowledge sources and represented by a Distributed Word Representation. Clustering is done using the K-means algorithm with Word Mover's Distance as the distance metric. Experiments were done to compare the clustering quality of this method, and several leading methods, using large datasets from BBC headlines, SearchSnippets, StackExchange, and Twitter. For all datasets, the proposed algorithm produced document clusters with higher accuracy, precision, F1-score, and Adjusted Rand Index. We also observe that cluster description can be inferred from keywords represented in each cluster.
更多
查看译文
关键词
Distributed word representation,document distance,short text documents,short text documents clustering
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要