Efficient Policy Evaluation with Offline Data Informed Behavior Policy Design
arxiv(2023)
摘要
Most reinforcement learning practitioners evaluate their policies with online
Monte Carlo estimators for either hyperparameter tuning or testing different
algorithmic design choices, where the policy is repeatedly executed in the
environment to get the average outcome. Such massive interactions with the
environment are prohibitive in many scenarios. In this paper, we propose novel
methods that improve the data efficiency of online Monte Carlo estimators while
maintaining their unbiasedness. We first propose a tailored closed-form
behavior policy that provably reduces the variance of an online Monte Carlo
estimator. We then design efficient algorithms to learn this closed-form
behavior policy from previously collected offline data. Theoretical analysis is
provided to characterize how the behavior policy learning error affects the
amount of reduced variance. Compared with previous works, our method achieves
better empirical performance in a broader set of environments, with fewer
requirements for offline data.
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要