Investigating Objectives for Off-policy Value Estimation in Reinforcement Learning

Andrew Patterson,Sina Ghiassian,Adam White

semanticscholar(2020)

引用 0|浏览0
暂无评分
摘要
This paper investigates the problem of online prediction learning, where prediction, action, and learning proceed continuously as the agent interacts with an unknown environment. The predictions made by the agent are contingent on a particular way of behaving—specifying what would happen if the agent behaved in a particular way—represented as a value function. However, the behavior used to select actions and generate the behavior data might be different from the behavior used to define the predictions, and thus the samples are generated off-policy. The ability to learn behavior-contingent predictions online and off-policy has long been advocated as a key capability of predictive-knowledge learning systems, but has remained an open algorithmic challenge for decades. The fundamental issue lies with the temporal difference learning update at the heart of most value-function learning algorithms: combining bootstrapping, off-policy sampling, and fixed-basis function approximation may cause the value estimate to diverge to infinity (e.g., Q-learning with linear function approximation). A major breakthrough came with the development of a new objective function, called the projected Bellman error, that admitted light-weight stochastic gradient descent variants of temporal difference learning. Since then, many sound online off-policy prediction algorithms have been developed, but largely for the linear setting. With this development has come several modifications on the objective itself, and has exposed a fundamental open question in off-policy value estimation: what objective should we use? In this work, we first summarize the large body of literature on off-policy learning, (1) highlighting the similarities in the underlying objectives for algorithms and (2) extracting the key strategies behind many of the algorithms, that can then be used across objectives. We then describe a generalized projected Bellman error, that naturally extends to the nonlinear value estimation setting. We show how this generalized objective unifies previous work, including previous theory. We use this simplified view to derive easy-to-use, but sound, algorithms that we show perform well in both prediction and control.
更多
查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要