ChemScraper: Leveraging PDF Graphics Instructions for Molecular Diagram Parsing
arxiv(2023)
摘要
Most molecular diagram parsers recover chemical structure from raster images
(e.g., PNGs). However, many PDFs include commands giving explicit locations and
shapes for characters, lines, and polygons. We present a new parser that uses
these born-digital PDF primitives as input. The parsing model is fast and
accurate, and does not require GPUs, Optical Character Recognition (OCR), or
vectorization. We use the parser to annotate raster images and then train a new
multi-task neural network for recognizing molecules in raster images. We
evaluate our parsers using SMILES and standard benchmarks, along with a novel
evaluation protocol comparing molecular graphs directly that supports automatic
error compilation and reveals errors missed by SMILES-based evaluation.
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要