Article Segmentation in Digitised Newspapers with a 2D Markov Model

ICDAR(2019)

引用 11|浏览30
暂无评分
摘要
Document analysis and recognition is increasingly used to digitise collections of historical books, newspapers and other periodicals. In the digital humanities, it is often the goal to apply information retrieval (IR) and natural language processing (NLP) techniques to help researchers analyse and navigate these digitised archives. The lack of article segmentation is impairing many IR and NLP systems, which assume text is split into ordered, error-free documents. We define a document analysis and image processing task for segmenting digitised newspapers into articles and other content, e.g. adverts, and we automatically create a dataset of 11602 articles. Using this dataset, we develop and evaluate an innovative 2D Markov model that encodes reading order and substantially outperforms the current state-of-the-art, reaching similar accuracy to human annotators.
更多
查看译文
关键词
article segmentation,historical newspapers,document analysis,layout analysis
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要