Imitation Learning to Outperform Demonstrators by Directly Extrapolating Demonstrations

Conference on Information and Knowledge Management(2022)

引用 0|浏览32
暂无评分
摘要
ABSTRACTWe consider the problem of imitation learning from suboptimal demonstrations that aims to learn a better policy than demonstrators. Previous methods usually learn a reward function to encode the underlying intention of the demonstrators and use standard reinforcement learning to learn a policy based on this reward function. Such methods can fail to control the distribution shift between demonstrations and the learned policy since the learned reward function may not generalize well on out-of-distribution samples and can mislead the agent to highly uncertain states, resulting in degenerated performance. To address this limitation, we propose a novel algorithm called Outperforming demonstrators by Directly Extrapolating Demonstrations(ODED). Instead of learning a reward function, ODED trains an ensemble of extrapolation networks that generate extrapolated demonstrations, i.e., demonstrations that may be induced by a good agent, based on provided demonstrations. With these extrapolated demonstrations, we can use an off-the-shelf imitation learning algorithm to learn a good policy. Guided by extrapolated demonstrations, the learned policy avoids visiting highly uncertain states and therefore controls the distribution shift. Empirically, we show that ODED outperforms suboptimal demonstrators and achieves better performance than state-of-the-art imitation learning algorithms on the MuJoCo and DeepMind Control Suite tasks.
更多
查看译文
关键词
extrapolating demonstrations,imitation learning,outperform demonstrators
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要