Actor-Mimic: Deep Multitask and Transfer Reinforcement Learning

emilio parisotto
emilio parisotto

international conference on learning representations, 2015.

Cited by: 288|Bibtex|Views63|Links
EI
Keywords:
deep policy networkMarkov Decision ProcessActor-Mimic Networkstate actionDeep Q-NetworkMore(11+)
Wei bo:
We show experimentally that this multitask pre-training can result in a Deep Q-Network that learns a target task significantly faster than a Deep Q-Network starting from a random initialization, effectively demonstrating that the source task representations generalize to the targ...

Abstract:

The ability to act in multiple environments and transfer previous knowledge to new situations can be considered a critical aspect of any intelligent agent. Towards this goal, we define a novel method of multitask and transfer learning that enables an autonomous agent to learn how to behave in multiple tasks simultaneously, and then gene...More

Code:

Data:

Introduction
  • Deep Reinforcement Learning (DRL), the combination of reinforcement learning methods and deep neural network function approximators, has recently shown considerable success in highdimensional challenging tasks, such as robotic manipulation (Levine et al, 2015; Lillicrap et al, 2015) and arcade games (Mnih et al, 2015)
  • These methods exploit the ability of deep networks to learn salient descriptions of raw state input, allowing the agent designer to essentially bypass the lengthy process of feature engineering.
Highlights
  • Deep Reinforcement Learning (DRL), the combination of reinforcement learning methods and deep neural network function approximators, has recently shown considerable success in highdimensional challenging tasks, such as robotic manipulation (Levine et al, 2015; Lillicrap et al, 2015) and arcade games (Mnih et al, 2015)
  • The Deep Q-Network is trained using Q-learning combined with several tricks that stabilize the training of the network, such as a replay memory to store past transitions and target networks to define a more consistent temporal difference error
  • The contribution of this paper is to develop and evaluate methods that enable multitask and transfer learning for Deep Reinforcement Learning agents, using the Arcade Learning Environment as a test environment
  • We show experimentally that this multitask pre-training can result in a Deep Q-Network that learns a target task significantly faster than a Deep Q-Network starting from a random initialization, effectively demonstrating that the source task representations generalize to the target task
  • An agent’s behaviour in an Markov Decision Process is represented as a policy π(a|s) which defines the probability of executing action a in state s
  • 5.1 MULTITASK To first evaluate the actor-mimic objective on multitask learning, we demonstrate the effectiveness of training an Actor-Mimic Network over multiple games simultaneously
Methods
  • The authors validate the Actor-Mimic method by demonstrating its effectiveness at both multitask and transfer learning in the Arcade Learning Environment (ALE).
  • 5.1 MULTITASK To first evaluate the actor-mimic objective on multitask learning, the authors demonstrate the effectiveness of training an AMN over multiple games simultaneously.
  • In this particular case, since the focus is Network DQN Mean Max AMN ×.
  • The authors can see that the AMN quickly reaches close-to-expert performance on 7 games out of 8, only taking around 20 epochs or 5 million training frames to settle to a stable behaviour.
  • This is in comparison to the expert networks, which were trained for up to 50 million frames
Conclusion
  • In this paper the authors defined Actor-Mimic, a novel method for training a single deep policy network over a set of related source tasks.
  • Using Actor-Mimic as a multitask pretraining phase can significantly improve learning speed in a set of target tasks.
  • This demonstrates that the features learnt over the source tasks can generalize to new target tasks, given a sufficient level of similarity between source and target tasks.
  • Using targeted knowledge transfer can potentially help in cases of negative transfer observed in the experiments
Summary
  • Introduction:

    Deep Reinforcement Learning (DRL), the combination of reinforcement learning methods and deep neural network function approximators, has recently shown considerable success in highdimensional challenging tasks, such as robotic manipulation (Levine et al, 2015; Lillicrap et al, 2015) and arcade games (Mnih et al, 2015)
  • These methods exploit the ability of deep networks to learn salient descriptions of raw state input, allowing the agent designer to essentially bypass the lengthy process of feature engineering.
  • Methods:

    The authors validate the Actor-Mimic method by demonstrating its effectiveness at both multitask and transfer learning in the Arcade Learning Environment (ALE).
  • 5.1 MULTITASK To first evaluate the actor-mimic objective on multitask learning, the authors demonstrate the effectiveness of training an AMN over multiple games simultaneously.
  • In this particular case, since the focus is Network DQN Mean Max AMN ×.
  • The authors can see that the AMN quickly reaches close-to-expert performance on 7 games out of 8, only taking around 20 epochs or 5 million training frames to settle to a stable behaviour.
  • This is in comparison to the expert networks, which were trained for up to 50 million frames
  • Conclusion:

    In this paper the authors defined Actor-Mimic, a novel method for training a single deep policy network over a set of related source tasks.
  • Using Actor-Mimic as a multitask pretraining phase can significantly improve learning speed in a set of target tasks.
  • This demonstrates that the features learnt over the source tasks can generalize to new target tasks, given a sufficient level of similarity between source and target tasks.
  • Using targeted knowledge transfer can potentially help in cases of negative transfer observed in the experiments
Tables
  • Table1: Actor-Mimic results on a set of eight Atari games. We compare the AMN performance to that of the expert DQNs trained separately on each game. The expert DQNs were trained until convergence and the AMN was trained for 100 training epochs, which is equivalent to 25 million input frames per source game. For the AMN, we report maximum test reward ever achieved in epochs 1-100 and mean test reward in epochs 91-100. For the DQN, we report maximum test reward ever achieved until convergence and mean test reward in the last 10 epochs of DQN training. Additionally, at the last row of the table we report the percentage ratio of the AMN reward to the expert DQN reward for every game for both mean and max rewards. These percentage ratios are plotted in Figure 6. The AMN results are averaged over 2 separately trained networks. BARPLOT
  • Table2: Actor-Mimic transfer results for a set of 7 games. The 3 networks are trained as DQNs on the target task, with the only difference being the weight initialization. “Random” means random initial weights, “AMNpolicy” means a weight initialization with an AMN trained using policy regression and “AMN-feature” means a weight initialization with an AMN trained using both policy and feature regression (see text for more details). We report the average test reward every 4 training epochs (equivalent to 1 million training frames), where the average is over 4 testing epochs that are evaluated immediately after each training epoch. For each game, we bold out the network results that have the highest average testing reward for that particular column. LEARNING CURVES
Download tables as Excel
Related work
  • The idea of using expert networks to guide a single mimic network has been studied in the context of supervised learning, where it is known as model compression. The goal of model compression is to reduce the computational complexity of a large model (or ensemble of large models) to a single smaller mimic network while maintaining as high an accuracy as possible. To obtain high accuracy, the mimic network is trained using rich output targets provided by the experts. These output targets are either the final layer logits (Ba & Caruana, 2014) or the high-temperature softmax outputs of the experts (Hinton et al, 2015). Our approach is most similar to the technique of (Hinton et al, 2015)

    which matches the high-temperature outputs of the mimic network with that of the expert network. In addition, we also tried an objective that provides expert guidance at the feature level instead of only at the output level. A similar idea was also explored in the model compression case (Romero et al, 2015), where a deep and thin mimic network used a larger expert network’s intermediate features as guiding hints during training. In contrast to these model compression techniques, our method is not concerned with decreasing test time computation but instead using experts to provide otherwise unavailable supervision to a mimic network on several distinct tasks.
Funding
  • This work was supported by Samsung and NSERC
Reference
  • Ba, Jimmy and Caruana, Rich. Do deep nets really need to be deep? In Advances in Neural Information Processing Systems, pp. 2654–2662, 2014.
    Google ScholarLocate open access versionFindings
  • Banerjee, Bikramjit and Stone, Peter. General game learning using knowledge transfer. In International Joint Conferences on Artificial Intelligence, pp. 672–677, 2007.
    Google ScholarLocate open access versionFindings
  • Bellemare, Marc G., Naddaf, Yavar, Veness, Joel, and Bowling, Michael. The arcade learning environment: An evaluation platform for general agents. Journal of Artificial Intelligence Research, 47:253–279, 2013.
    Google ScholarLocate open access versionFindings
  • Bertsekas, Dimitri P. Dynamic programming and optimal control, volume 1. Athena Scientific Belmont, MA, 1995.
    Google ScholarFindings
  • Guo, Xiaoxiao, Singh, Satinder, Lee, Honglak, Lewis, Richard L, and Wang, Xiaoshi. Deep learning for real-time atari game play using offline monte-carlo tree search planning. In Advances in Neural Information Processing Systems 27, pp. 3338–3346, 2014.
    Google ScholarLocate open access versionFindings
  • Hinton, Geoffrey, Vinyals, Oriol, and Dean, Jeff. Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531, 2015.
    Findings
  • Kingma, Diederik P. and Ba, Jimmy. Adam: A method for stochastic optimization. In International Conference on Learning Representations, 2015.
    Google ScholarLocate open access versionFindings
  • Konidaris, George and Barto, Andrew G. Autonomous shaping: Knowledge transfer in reinforcement learning. In Proceedings of the 23rd international conference on Machine learning, pp. 489–496, 2006.
    Google ScholarLocate open access versionFindings
  • Levine, Sergey and Koltun, Vladlen. Guided policy search. In Proceedings of the 30th international conference on Machine Learning, 2013.
    Google ScholarLocate open access versionFindings
  • Levine, Sergey, Finn, Chelsea, Darrell, Trevor, and Abbeel, Pieter. End-to-end training of deep visuomotor policies. CoRR, abs/1504.00702, 2015.
    Findings
  • Lillicrap, Timothy P., Hunt, Jonathan J., Pritzel, Alexander, Heess, Nicholas, Erez, Tom, Tassa, Yuval, Silver, David, and Wierstra, Daan. Continuous control with deep reinforcement learning. CoRR, abs/1509.02971, 2015.
    Findings
  • Mnih, Volodymyr, Kavukcuoglu, Koray, Silver, David, Rusu, Andrei A., Veness, Joel, Bellemare, Marc G., Graves, Alex, Riedmiller, Martin, Fidjeland, Andreas K., Ostrovski, Georg, Petersen, Stig, Beattie, Charles, Sadik, Amir, Antonoglou, Ioannis, King, Helen, Kumaran, Dharshan, Wierstra, Daan, Legg, Shane, and Hassabis, Demis. Human-level control through deep reinforcement learning. Nature, 518(7540):529–533, 2015.
    Google ScholarLocate open access versionFindings
  • Perkins, Theodore J and Precup, Doina. A convergent form of approximate policy iteration. In Advances in neural information processing systems, pp. 1595–1602, 2002.
    Google ScholarLocate open access versionFindings
  • Robbins, Herbert and Monro, Sutton. A stochastic approximation method. The annals of mathematical statistics, pp. 400–407, 1951.
    Google ScholarFindings
  • Romero, Adriana, Ballas, Nicolas, Kahou, Samira Ebrahimi, Chassang, Antoine, Gatta, Carlo, and Bengio, Yoshua. Fitnets: Hints for thin deep nets. In International Conference on Learning Representations, 2015.
    Google ScholarLocate open access versionFindings
  • Ross, Stephane, Gordon, Geoffrey, and Bagnell, Andrew. A reduction of imitation learning and structured prediction to no-regret online learning. Journal of Machine Learning Research, 15: 627–635, 2011.
    Google ScholarLocate open access versionFindings
  • Seneta, E. Sensitivity analysis, ergodicity coefficients, and rank-one updates for finite markov chains. Numerical solution of Markov chains, 8:121–129, 1991.
    Google ScholarLocate open access versionFindings
  • Sutton, Richard S. and Barto, Andrew G. Reinforcement learning: An introduction. MIT Press Cambridge, 1998.
    Google ScholarFindings
  • Taylor, Matthew E and Stone, Peter. Transfer learning for reinforcement learning domains: A survey. The Journal of Machine Learning Research, 10:1633–1685, 2009.
    Google ScholarLocate open access versionFindings
Your rating :
0

 

Tags
Comments