Web genre classification with methods for structured output prediction.

Gjorgji Madjarov,Vedrana Vidulin,Ivica Dimitrovski,Dragi Kocev

Information Sciences（2019）

引用 20|浏览61

暂无评分

摘要

The increase of the number of web pages prompts for improvement of the search engines. One such improvement is specifying the desired web genre of the resulting web pages. The prediction of web genres triggers expectations about the type of information contained in a given web page. More specifically, web genres can be seen as textual categories such as scientific papers, home pages or eshops. Arguably, in the context of web search, specifying genre beside topical keywords enables a user to easily find a scientific paper (genre) about text mining (topic). Typically, web genre prediction is treated as a predictive modelling task of multi-class classification, with some recent studies advocating the introduction of a structure in the output space: either by considering multiple web genres per web page or exploiting a hierarchy of web genres. We investigate the structuring of the output space by constructing hierarchies using data-driven methods, experts or even randomly. We also use 10 different representations of the web pages. We use predictive clustering trees and ensembles thereof to properly assess the influence of the different information sources. The experimental evaluation is performed on two benchmark corpora: 20-genre and SANTINIS-ML. The results reveal that exploiting a hierarchy of web genres yields best predictive performance across both datasets, all predictive models, all feature sets and all hierarchies. Next, data-driven hierarchy construction is at least as good as expert-constructed hierarchy with the added value that the hierarchy construction is automatic and fast. Furthermore, ensembles offer state-of-the-art predictive performance and they have a superior performance than single tree models.

查看译文

关键词

Web genre classification,Hierarchy construction,Hierarchical multi-label classification

AI 理解论文

溯源树

样例

生成溯源树，研究论文发展脉络

Chat Paper

正在生成论文摘要