Improved Estimator Selection for Off-Policy Evaluation

semanticscholar(2021)

引用 0|浏览25
暂无评分
摘要
Off-policy policy evaluation is a fundamental problem in reinforcement learning. As a result, many estimators with different tradeoffs have been developed; however, selecting the best estimator is challenging with limited data and without additional interactive data collection. Recently, Su et al. (2020b) developed a datadependent selection procedure that competes with the oracle selection up to a constant and demonstrate its practicality. We refine the analysis to remove an extraneous assumption and improve the procedure. The improved procedure results in a tighter oracle bound and stronger empirical results on a contextual bandit task.
更多
查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要