ChemScraper: Leveraging PDF Graphics Instructions for Molecular Diagram Parsing

Ayush Kumar Shah, Bryan Manrique Amador,Abhisek Dey, Ming Creekmore, Blake Ocampo,Scott Denmark,Richard Zanibbi

arxiv(2023)

引用 0|浏览5
暂无评分
摘要
Most molecular diagram parsers recover chemical structure from raster images (e.g., PNGs). However, many PDFs include commands giving explicit locations and shapes for characters, lines, and polygons. We present a new parser that uses these born-digital PDF primitives as input. The parsing model is fast and accurate, and does not require GPUs, Optical Character Recognition (OCR), or vectorization. We use the parser to annotate raster images and then train a new multi-task neural network for recognizing molecules in raster images. We evaluate our parsers using SMILES and standard benchmarks, along with a novel evaluation protocol comparing molecular graphs directly that supports automatic error compilation and reveals errors missed by SMILES-based evaluation.
更多
查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要