Text clustering with LLM embeddings
CoRR(2024)
摘要
Text clustering is an important approach for organising the growing amount of
digital content, helping to structure and find hidden patterns in uncategorised
data. In this research, we investigated how different textual embeddings -
particularly those used in large language models (LLMs) - and clustering
algorithms affect how text datasets are clustered. A series of experiments were
conducted to assess how embeddings influence clustering results, the role
played by dimensionality reduction through summarisation, and embedding size
adjustment. Results reveal that LLM embeddings excel at capturing the nuances
of structured language, while BERT leads the lightweight options in
performance. In addition, we find that increasing embedding dimensionality and
summarisation techniques do not uniformly improve clustering efficiency,
suggesting that these strategies require careful analysis to use in real-life
models. These results highlight a complex balance between the need for nuanced
text representation and computational feasibility in text clustering
applications. This study extends traditional text clustering frameworks by
incorporating embeddings from LLMs, thereby paving the way for improved
methodologies and opening new avenues for future research in various types of
textual analysis.
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要