Extracting General Lists from Web Documents: A Hybrid Approach.

Fabio Fumarola,Tim Weninger,Rick Barber,Donato Malerba,Jiawei Han

IEA/AIE'11 Proceedings of the 24th international conference on Industrial engineering and other applications of applied intelligent systems conference on Modern approaches in applied intelligence - Volume Part I（2011）

引用 17|浏览0

暂无评分

摘要

The problem of extracting structured data ( i.e . lists, record sets, tables, etc.) from the Web has been traditionally approached by taking into account either the underlying markup structure of a Web page or the visual structure of the Web page. However, empirical results show that considering the HTML structure and visual cues of a Web page independently do not generalize well. We propose a new hybrid method to extract general lists from the Web. It employs both general assumptions on the visual rendering of lists, and the structural representation of items contained in them. We show that our method significantly outperforms existing methods across a varied Web corpus.

查看译文

关键词

Web page,varied Web corpus,HTML structure,underlying markup structure,visual cue,visual rendering,visual structure,general assumption,general list,new hybrid method,hybrid approach,web document

AI 理解论文

溯源树

样例

生成溯源树，研究论文发展脉络

Chat Paper

正在生成论文摘要