A Hierarchical Orthographic Similarity Measure for Interconnected Texts Represented by Graphs

APPLIED SCIENCES-BASEL(2024)

引用 0|浏览0
暂无评分
摘要
Similarity measures play a pivotal role in automatic techniques designed to analyse large volumes of textual data. Conventional approaches, treating texts as paradigmatic examples of unstructured data, tend to overlook their structural nuances, leading to a loss of valuable information. In this paper, we propose a novel orthographic similarity measure tailored for the semi-structured analysis of texts. We explore a graph-based representation for texts, where the graph's structure is shaped by a hierarchical decomposition of textual discourse units. Employing the concept of edit distances, our orthographic similarity measure is computed hierarchically across all components in this textual graph, integrating precomputed similarity values among lower-level nodes. The relevance and applicability of the presented approach are illustrated by a real-world example, featuring texts that exhibit intricate interconnections among their components. The resulting similarity scores, between all different structural levels of the graph, allow for a deeper understanding of the (structural) interconnections among texts and enhances the explainability of similarity measures as well as the tools using them.
更多
查看译文
关键词
text similarity,syntactic similarity,orthographic similarity,text analysis,graph databases
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要