Residual Q-Learning: Offline and Online Policy Customization without Value
NeurIPS(2023)
摘要
Imitation Learning (IL) is a widely used framework for learning imitative
behavior from demonstrations. It is especially appealing for solving complex
real-world tasks where handcrafting reward function is difficult, or when the
goal is to mimic human expert behavior. However, the learned imitative policy
can only follow the behavior in the demonstration. When applying the imitative
policy, we may need to customize the policy behavior to meet different
requirements coming from diverse downstream tasks. Meanwhile, we still want the
customized policy to maintain its imitative nature. To this end, we formulate a
new problem setting called policy customization. It defines the learning task
as training a policy that inherits the characteristics of the prior policy
while satisfying some additional requirements imposed by a target downstream
task. We propose a novel and principled approach to interpret and determine the
trade-off between the two task objectives. Specifically, we formulate the
customization problem as a Markov Decision Process (MDP) with a reward function
that combines 1) the inherent reward of the demonstration; and 2) the add-on
reward specified by the downstream task. We propose a novel framework, Residual
Q-learning, which can solve the formulated MDP by leveraging the prior policy
without knowing the inherent reward or value function of the prior policy. We
derive a family of residual Q-learning algorithms that can realize offline and
online policy customization, and show that the proposed algorithms can
effectively accomplish policy customization tasks in various environments. Demo
videos and code are available on our website:
https://sites.google.com/view/residualq-learning.
更多查看译文
关键词
online policy customization,offline,q-learning
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要