Reward-Adaptive Reinforcement Learning: Dynamic Policy Gradient Optimization for Bipedal Locomotion

IEEE Transactions on Pattern Analysis and Machine Intelligence(2023)

引用 10|浏览65
暂无评分
摘要
Controlling a non-statically bipedal robot is challenging due to the complex dynamics and multi-criterion optimization involved. Recent works have demonstrated the effectiveness of deep reinforcement learning (DRL) for simulation and physical robots. In these methods, the rewards from different criteria are normally summed to learn a scalar function. However, a scalar is less informative and may be insufficient to derive effective information for each reward channel from the complex hybrid rewards. In this work, we propose a novel reward-adaptive reinforcement learning method for biped locomotion, allowing the control policy to be simultaneously optimized by multiple criteria using a dynamic mechanism. The proposed method applies a multi-head critic to learn a separate value function for each reward component, leading to hybrid policy gradients. We further propose dynamic weight, allowing each component to optimize the policy with different priorities. This hybrid and dynamic policy gradient (HDPG) design makes the agent learn more efficiently. We show that the proposed method outperforms summed-up-reward approaches and is able to transfer to physical robots. The MuJoCo results further demonstrate the effectiveness and generalization of HDPG.
更多
查看译文
关键词
Deep reinforcement learning,robotics,bipedal locomotion,hybrid reward architecture
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要