Exploiting Estimation Bias in Deep Double Q-Learning for Actor-Critic Methods
CoRR(2024)
摘要
This paper introduces innovative methods in Reinforcement Learning (RL),
focusing on addressing and exploiting estimation biases in Actor-Critic methods
for continuous control tasks, using Deep Double Q-Learning. We propose two
novel algorithms: Expectile Delayed Deep Deterministic Policy Gradient (ExpD3)
and Bias Exploiting - Twin Delayed Deep Deterministic Policy Gradient (BE-TD3).
ExpD3 aims to reduce overestimation bias with a single Q estimate, offering a
balance between computational efficiency and performance, while BE-TD3 is
designed to dynamically select the most advantageous estimation bias during
training. Our extensive experiments across various continuous control tasks
demonstrate the effectiveness of our approaches. We show that these algorithms
can either match or surpass existing methods like TD3, particularly in
environments where estimation biases significantly impact learning. The results
underline the importance of bias exploitation in improving policy learning in
RL.
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要