谷歌浏览器插件
订阅小程序
在清言上使用

Kernel Metric Learning for In-Sample Off-Policy Evaluation of Deterministic RL Policies

ICLR 2024(2024)

引用 0|浏览36
暂无评分
摘要
We consider off-policy evaluation (OPE) of deterministic target policies forreinforcement learning (RL) in environments with continuous action spaces.While it is common to use importance sampling for OPE, it suffers from highvariance when the behavior policy deviates significantly from the targetpolicy. In order to address this issue, some recent works on OPE proposedin-sample learning with importance resampling. Yet, these approaches are notapplicable to deterministic target policies for continuous action spaces. Toaddress this limitation, we propose to relax the deterministic target policyusing a kernel and learn the kernel metrics that minimize the overall meansquared error of the estimated temporal difference update vector of an actionvalue function, where the action value function is used for policy evaluation.We derive the bias and variance of the estimation error due to this relaxationand provide analytic solutions for the optimal kernel metric. In empiricalstudies using various test domains, we show that the OPE with in-samplelearning using the kernel with optimized metric achieves significantly improvedaccuracy than other baselines.
更多
查看译文
关键词
off-policy evaluation,reinforcement learning,deterministic policy,continuous actions,metric learning
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要