Off-Policy Primal-Dual Safe Reinforcement Learning
CoRR(2024)
摘要
Primal-dual safe RL methods commonly perform iterations between the primal
update of the policy and the dual update of the Lagrange Multiplier. Such a
training paradigm is highly susceptible to the error in cumulative cost
estimation since this estimation serves as the key bond connecting the primal
and dual update processes. We show that this problem causes significant
underestimation of cost when using off-policy methods, leading to the failure
to satisfy the safety constraint. To address this issue, we propose
conservative policy optimization, which learns a policy in a
constraint-satisfying area by considering the uncertainty in cost estimation.
This improves constraint satisfaction but also potentially hinders reward
maximization. We then introduce local policy convexification to help
eliminate such suboptimality by gradually reducing the estimation uncertainty.
We provide theoretical interpretations of the joint coupling effect of these
two ingredients and further verify them by extensive experiments. Results on
benchmark tasks show that our method not only achieves an asymptotic
performance comparable to state-of-the-art on-policy methods while using much
fewer samples, but also significantly reduces constraint violation during
training. Our code is available at https://github.com/ZifanWu/CAL.
更多查看译文
关键词
Safe Reinforcement Learning
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要