PEaCE: A Chemistry-Oriented Dataset for Optical Character Recognition on Scientific Documents
arxiv(2024)
摘要
Optical Character Recognition (OCR) is an established task with the objective
of identifying the text present in an image. While many off-the-shelf OCR
models exist, they are often trained for either scientific (e.g., formulae) or
generic printed English text. Extracting text from chemistry publications
requires an OCR model that is capable in both realms. Nougat, a recent tool,
exhibits strong ability to parse academic documents, but is unable to parse
tables in PubMed articles, which comprises a significant part of the academic
community and is the focus of this work. To mitigate this gap, we present the
Printed English and Chemical Equations (PEaCE) dataset, containing both
synthetic and real-world records, and evaluate the efficacy of
transformer-based OCR models when trained on this resource. Given that
real-world records contain artifacts not present in synthetic records, we
propose transformations that mimic such qualities. We perform a suite of
experiments to explore the impact of patch size, multi-domain training, and our
proposed transformations, ultimately finding that models with a small patch
size trained on multiple domains using the proposed transformations yield the
best performance. Our dataset and code is available at
https://github.com/ZN1010/PEaCE.
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要