Reducing Vision-Answer Biases for Multiple-Choice VQA

IEEE transactions on image processing : a publication of the IEEE Signal Processing Society(2023)

引用 4|浏览33
暂无评分
摘要
Multiple-choice visual question answering (VQA) is a challenging task due to the requirement of thorough multimodal understanding and complicated inter-modality relationship reasoning. To solve the challenge, previous approaches usually resort to different multimodal interaction modules. Despite their effectiveness, we find that existing methods may exploit a new discovered bias (vision-answer bias) to make answer prediction, leading to suboptimal VQA performances and poor generalization. To solve the issues, we propose a Causality-based Multimodal Interaction Enhancement (CMIE) method, which is model-agnostic and can be seamlessly incorporated into a wide range of VQA approaches in a plug-and-play manner. Specifically, our CMIE contains two key components: a causal intervention module and a counterfactual interaction learning module. The former devotes to removing the spurious correlation between the visual content and the answer caused by the vision-answer bias, and the latter helps capture discriminative inter-modality relationships by directly supervising multimodal interaction training via an interactive loss. Extensive experimental results on three public benchmarks and one reorganized dataset show that the proposed method can significantly improve seven representative VQA models, demonstrating the effectiveness and generalizability of the CMIE.
更多
查看译文
关键词
Multiple-choice VQA,vision-answer bias,causal intervention,counterfactual interaction learning
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要