Beyond OCR + VQA: Involving OCR into the Flow for Robust and Accurate TextVQA

Gangyan Zeng,Yuan Zhang,Yu Zhou,Xiaomeng Yang

International Multimedia Conference（2021）

引用 33|浏览46

暂无评分

摘要

ABSTRACTText-based visual question answering (TextVQA) requires analyzing both the visual contents and texts in an image to answer a question, which is more practical than general visual question answering (VQA). Existing efforts tend to regard optical character recognition (OCR) as a pre-processing and then combine it with a VQA framework. It makes the performance of multimodal reasoning and question answering highly depend on the accuracy of OCR. In this work, we address this issue with two perspectives. First, we take advantages of multimodal cues to complete the semantic information of texts. A visually enhanced text embedding is proposed to enable understanding of texts without accurately recognizing them. Second, we further leverage rich contextual information to modify the answer texts even if the OCR module does not correctly recognize them. In addition, the visual objects are endued with semantic representations to enable objects in the same semantic space as OCR tokens. Equipped with these techniques, the cumulative error propagation caused by poor OCR performance is effectively suppressed. Extensive experiments on TextVQA and ST-VQA datasets demonstrate that our approach achieves the state-of-the-art performance in terms of accuracy and robustness.

查看译文

关键词

ocr,vqa,beyond

AI 理解论文

溯源树

样例

生成溯源树，研究论文发展脉络

Chat Paper

正在生成论文摘要