Generate then Select: Open-ended Visual Question Answering Guided by World Knowledge

Xingyu Fu,Sheng Zhang, Gukyeong Kwon, Pramuditha Perera,Henghui Zhu,Yuhao Zhang,Alexander Hanbo Li, William Yang Wang,Zhiguo Wang,Vittorio Castelli,Patrick Ng,Dan Roth,Bing Xiang

conf_acl（2023）

引用 2|浏览131

暂无评分

摘要

The open-ended Visual Question Answering (VQA) task requires AI models to jointly reason over visual and natural language inputs using world knowledge. Recently, pre-trained Language Models (PLM) such as GPT-3 have been applied to the task and shown to be powerful world knowledge sources. However, these methods suffer from low knowledge coverage caused by PLM bias -- the tendency to generate certain tokens over other tokens regardless of prompt changes, and high dependency on the PLM quality -- only models using GPT-3 can achieve the best result. To address the aforementioned challenges, we propose RASO: a new VQA pipeline that deploys a generate-then-select strategy guided by world knowledge for the first time. Rather than following the de facto standard to train a multi-modal model that directly generates the VQA answer, RASO first adopts PLM to generate all the possible answers, and then trains a lightweight answer selection model for the correct answer. As proved in our analysis, RASO expands the knowledge coverage from in-domain training data by a large margin. We provide extensive experimentation and show the effectiveness of our pipeline by advancing the state-of-the-art by 4.1% on OK-VQA, without additional computation cost. Code and models are released at http://cogcomp.org/page/publication_view/1010

查看译文

关键词

visual question answering,knowledge,select,world,open-ended

AI 理解论文

溯源树

样例

生成溯源树，研究论文发展脉络

Chat Paper

正在生成论文摘要