Learning Cooperative Visual Dialog Agents with Deep Reinforcement Learning

Abhishek Das,Satwik Kottur,José M. F. Moura,Stefan Lee,Dhruv Batra

2017 IEEE International Conference on Computer Vision (ICCV)（2017）

引用 454|浏览535

暂无评分

摘要

We introduce the first goal-driven training for visual question answering and dialog agents. Specifically, we pose a cooperative 'image guessing' game between two agents -- Qbot and Abot -- who communicate in natural language dialog so that Qbot can select an unseen image from a lineup of images. We use deep reinforcement learning (RL) to learn the policies of these agents end-to-end -- from pixels to multi-agent multi-round dialog to game reward. We demonstrate two experimental results. First, as a 'sanity check' demonstration of pure RL (from scratch), we show results on a synthetic world, where the agents communicate in ungrounded vocabulary, i.e., symbols with no pre-specified meanings (X, Y, Z). We find that two bots invent their own communication protocol and start using certain symbols to ask/answer about certain visual attributes (shape/color/style). Thus, we demonstrate the emergence of grounded language and communication among 'visual' dialog agents with no human supervision. Second, we conduct large-scale real-image experiments on the VisDial dataset, where we pretrain with supervised dialog data and show that the RL 'fine-tuned' agents significantly outperform SL agents. Interestingly, the RL Qbot learns to ask questions that Abot is good at, ultimately resulting in more informative dialog and a better team.

查看译文

关键词

visual dialog agents,deep reinforcement learning,goal-driven training,visual question answering,cooperative image guessing game,A-BOT,natural language dialog,unseen image,multiagent multiround dialog,bots,communication protocol,visual attributes,grounded language,real-image experiments,dialog data,supervised learning,RL finetuned agents,informative dialog,supervised pretraining,RL Q-BOT

AI 理解论文

溯源树

样例

生成溯源树，研究论文发展脉络

Chat Paper

正在生成论文摘要