Reinforcement Learning from Bagged Reward: A Transformer-based Approach for Instance-Level Reward Redistribution
CoRR(2024)
摘要
In reinforcement Learning (RL), an instant reward signal is generated for
each action of the agent, such that the agent learns to maximize the cumulative
reward to obtain the optimal policy. However, in many real-world applications,
the instant reward signals are not obtainable by the agent. Instead, the
learner only obtains rewards at the ends of bags, where a bag is defined as a
partial sequence of a complete trajectory. In this situation, the learner has
to face the significant difficulty of exploring the unknown instant rewards in
the bags, which could not be addressed by existing approaches, including those
trajectory-based approaches that consider only complete trajectories and ignore
the inner reward distributions. To formally study this situation, we introduce
a novel RL setting termed Reinforcement Learning from Bagged Rewards (RLBR),
where only the bagged rewards of sequences can be obtained. We provide the
theoretical study to establish the connection between RLBR and standard RL in
Markov Decision Processes (MDPs). To effectively explore the reward
distributions within the bagged rewards, we propose a Transformer-based reward
model, the Reward Bag Transformer (RBT), which uses the self-attention
mechanism for interpreting the contextual nuances and temporal dependencies
within each bag. Extensive experimental analyses demonstrate the superiority of
our method, particularly in its ability to mimic the original MDP's reward
distribution, highlighting its proficiency in contextual understanding and
adaptability to environmental dynamics.
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要