A Comparative Study of Margin Noise Removal Algorithms on MarNR: A Margin Noise Dataset of Document Images

2017 14th IAPR International Conference on Document Analysis and Recognition (ICDAR)(2017)

引用 4|浏览6
暂无评分
摘要
Margin noise removal is an important step prior to segmentation and optical character recognition (OCR) of a page. Presence of this noise results in erroneous output by the segmentation algorithms and OCR systems. In this paper, we present a margin noise removal dataset MarNR. A comparative study of four margin noise removal algorithms is also presented in this paper. For the purpose of evaluation, we have considered seven metrics. The metrics Hamming distance, noise ratio, and page content removal aim to evaluate a margin noise removal algorithm either on the quantity of noise removed or on the original content of the image retrieved. We also consider margin noise removal as a bi-classification task and four metrics of evaluation are defined using confusion matrices obtained experimentally over a labeled test dataset explicitly generated for evaluating the margin noise removal algorithms. The dataset consists of various document images with variation in layout and margin noises. The labeled dataset is also made public for comparative study of different margin noise removal algorithms.
更多
查看译文
关键词
Margin noise dataset,Performance evaluation,Margin noise removal
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要