The Virtues of Pessimism in Inverse Reinforcement Learning
CoRR(2024)
摘要
Inverse Reinforcement Learning (IRL) is a powerful framework for learning
complex behaviors from expert demonstrations. However, it traditionally
requires repeatedly solving a computationally expensive reinforcement learning
(RL) problem in its inner loop. It is desirable to reduce the exploration
burden by leveraging expert demonstrations in the inner-loop RL. As an example,
recent work resets the learner to expert states in order to inform the learner
of high-reward expert states. However, such an approach is infeasible in the
real world. In this work, we consider an alternative approach to speeding up
the RL subroutine in IRL: pessimism, i.e., staying close to the expert's
data distribution, instantiated via the use of offline RL algorithms. We
formalize a connection between offline RL and IRL, enabling us to use an
arbitrary offline RL algorithm to improve the sample efficiency of IRL. We
validate our theory experimentally by demonstrating a strong correlation
between the efficacy of an offline RL algorithm and how well it works as part
of an IRL procedure. By using a strong offline RL algorithm as part of an IRL
procedure, we are able to find policies that match expert performance
significantly more efficiently than the prior art.
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要