Finding the boundaries of compound documents on the web

William Y. Arms,Pavel Alexandrovich Dmitriev

Finding the boundaries of compound documents on the web（2008）

引用 24|浏览11

暂无评分

摘要

In both paper and digital worlds information resources generally consist of individual parts combined into a whole. A book is an aggregation of physical pages, and a web document is often published as several HTML pages. While in the paper world the connection between the aggregated resource and its parts is often obvious, in the digital world this connection is less clear and may depend on the context. Given a set of information units, the problem is to decide how-many aggregated resources there are, and which information units are parts of which resource. The ability to automatically identify aggregate resources is useful in many applications. On the web, example applications include web and intranet search, user navigation, automated collection generation, and information extraction. It is also useful in biological applications, digital libraries, and P2P systems. The focus of this dissertation is developing algorithms for automatically identifying such aggregate resources. While the main scope of the dissertation is the problem of identifying such resources, or compound documents (cDocs), on the web, the proposed techniques can be extended to other domains. The problem is difficult because the structure of cDocs typically varies and is not necessarily inherent in the web site, but may be imposed by a specific application or context. To address these complexities, this dissertation describes a machine learning based approach to recognizing cDocs. Given example cDocs provided by an agent, the approach first infers the agent's criteria of a cDoc, and then identifies other cDocs according to these criteria. Two specific approaches implementing the above framework are described. The first approach, Weighted Graph Clustering, learns the correct clustering of a web site based on analysis of the features of individual web pages and their immediate neighbors. The second approach, Generalized Pattern Matching, is based on graph mining techniques and analyses the web site as a whole. The approaches are evaluated on real web sites from educational, commercial, and news domains. The strengths and weaknesses of each approach are discussed and a combined hybrid approach is described. Experiments demonstrate significant improvements in performance over baseline heuristic approaches.

查看译文

关键词

compound document,aggregate resource,specific approach,combined hybrid approach,individual web page,real web site,web site,baseline heuristic approach,information unit,digital worlds information resource,web document

AI 理解论文

溯源树

样例

生成溯源树，研究论文发展脉络

Chat Paper

正在生成论文摘要