# Evolution Strategies as a Scalable Alternative to Reinforcement Learning

arXiv: Machine Learning, Volume abs/1703.03864, 2017.

EI

Keywords:

optimization algorithmnatural evolution strategiesMarkov Decision Processatari gamevalue functionMore(6+)

Weibo:

Abstract:

We explore the use of Evolution Strategies (ES), a class of black box optimization algorithms, as an alternative to popular MDP-based RL techniques such as Q-learning and Policy Gradients. Experiments on MuJoCo and Atari show that ES is a viable solution strategy that scales extremely well with the number of CPUs available: By using a nov...More

Code:

Data:

Introduction

- Developing agents that can accomplish challenging tasks in complex, uncertain environments is a key goal of artificial intelligence.
- The authors found that the use of virtual batch normalization [Salimans et al, 2016] and other reparameterizations of the neural network policy greatly improve the reliability of evolution strategies
- Without these methods ES proved brittle in the experiments, but with these reparameterizations the authors achieved strong results over a wide variety of environments

Highlights

- Developing agents that can accomplish challenging tasks in complex, uncertain environments is a key goal of artificial intelligence
- We found the evolution strategies method to be highly parallelizable: by introducing a novel communication strategy based on common random numbers, we are able to achieve linear speedups in run time even when using over a thousand workers
- We focus on reinforcement learning problems, so F (·) will be the stochastic return provided by an environment, and θ will be the parameters of a deterministic or stochastic policy πθ describing an agent acting in that environment, controlled by either discrete or continuous actions
- For the Atari environments, we found that Gaussian parameter perturbations on DeepMind’s convolutional architectures [Mnih et al, 2015] did not always lead to adequate exploration: For some environments, randomly perturbed parameters tended to encode policies that always took one specific action regardless of the state that was given as input
- We have explored Evolution Strategies, a class of black-box optimization algorithms, as an alternative to popular Markov Decision Process-based reinforcement learning techniques such as Q-learning and policy gradients
- A proof of concept for meta-learning in an reinforcement learning setting was given by Duan et al [2016b]: Using black-box optimization we hope to be able to extend these results

Methods

- 4.1 MuJoCo The authors evaluated ES on a benchmark of continuous robotic control problems in the OpenAI Gym [Brockman et al, 2016] against a highly tuned implementation of Trust Region Policy Optimization [Schulman et al, 2015], a policy gradient algorithm designed to efficiently optimize neural network policies.
- The authors found that ES was able to solve these tasks up to TRPO’s final performance after 5 million timesteps of environment interaction
- To obtain this result, the authors ran ES over 6 random seeds and compared the mean learning curves to computed curves for TRPO.
- The authors achieved up to 3x better sample complexity than TRPO

Conclusion

- The authors have explored Evolution Strategies, a class of black-box optimization algorithms, as an alternative to popular MDP-based RL techniques such as Q-learning and policy gradients.
- The authors plan to apply evolution strategies to those problems for which MDP-based reinforcement learning is less well-suited: problems with long time horizons and complicated reward structure.
- A proof of concept for meta-learning in an RL setting was given by Duan et al [2016b]: Using black-box optimization the authors hope to be able to extend these results.

Summary

## Introduction:

Developing agents that can accomplish challenging tasks in complex, uncertain environments is a key goal of artificial intelligence.- The authors found that the use of virtual batch normalization [Salimans et al, 2016] and other reparameterizations of the neural network policy greatly improve the reliability of evolution strategies
- Without these methods ES proved brittle in the experiments, but with these reparameterizations the authors achieved strong results over a wide variety of environments
## Methods:

4.1 MuJoCo The authors evaluated ES on a benchmark of continuous robotic control problems in the OpenAI Gym [Brockman et al, 2016] against a highly tuned implementation of Trust Region Policy Optimization [Schulman et al, 2015], a policy gradient algorithm designed to efficiently optimize neural network policies.- The authors found that ES was able to solve these tasks up to TRPO’s final performance after 5 million timesteps of environment interaction
- To obtain this result, the authors ran ES over 6 random seeds and compared the mean learning curves to computed curves for TRPO.
- The authors achieved up to 3x better sample complexity than TRPO
## Conclusion:

The authors have explored Evolution Strategies, a class of black-box optimization algorithms, as an alternative to popular MDP-based RL techniques such as Q-learning and policy gradients.- The authors plan to apply evolution strategies to those problems for which MDP-based reinforcement learning is less well-suited: problems with long time horizons and complicated reward structure.
- A proof of concept for meta-learning in an RL setting was given by Duan et al [2016b]: Using black-box optimization the authors hope to be able to extend these results.

- Table1: MuJoCo tasks: Ratio of ES timesteps to TRPO timesteps needed to reach various percentages of TRPO’s learning progress at 5 million timesteps
- Table2: Final results obtained using Evolution Strategies on Atari 2600 games (feedforward CNN policy, deterministic policy evaluation, averaged over 10 re-runs with up to 30 random initial no-ops), and compared to results for DQN and A3C from <a class="ref-link" id="cMnih_et+al_2016_a" href="#rMnih_et+al_2016_a">Mnih et al [2016</a>] and HyperNEAT from <a class="ref-link" id="cHausknecht_et+al_2014_a" href="#rHausknecht_et+al_2014_a">Hausknecht et al [2014</a>]. A2C is our synchronous variant of A3C, and its reported scores are obtained with 320M training frames with the same evaluation setup as for the ES results. All methods were trained on raw pixel input
- Table3: MuJoCo tasks: Ratio of ES timesteps to TRPO timesteps needed to reach various percentages of TRPO’s learning progress at 5 million timesteps. These results were computed from ES learning curves averaged over 6 reruns

Related work

- There have been many attempts at applying methods related to ES to train neural networks Risi and Togelius [2015]. For Atari, Hausknecht et al [2014] obtain impressive results. Sehnke et al [2010] proposed a method closely related the one investigated in our work. Koutník et al [2013, 2010] and Srivastava et al [2012] have similarly applied an an ES method to RL problems with visual inputs, but where the policy was compressed in a number of different ways. Natural evolution strategies has been successfully applied to black box optimization Wierstra et al [2008, 2014], as well as for the training of the recurrent weights in recurrent neural networks Schmidhuber et al [2007]. Stulp and Sigaud [2012] explored similar approaches to black box optimization. An interesting hybrid of black-box optimization and policy gradient methods was recently explored by Usunier et al [2016]. Hyper-Neat Stanley et al [2009] is an alternative approach to evolving both the weights of the neural networks and their parameters. Derivative free optimization methods have also been analyzed in the convex setting Duchi et al [2015], Nesterov [2012]. The main contribution in our work is in showing that this class of algorithms is extremely scalable and efficient to use on distributed hardware. We have shown that ES, when carefully implemented, is competitive with competing RL algorithms in terms of performance on the hardest problems solvable today, and is surprisingly close in terms of data efficiency, while taking less wallclock time to train.

Reference

- Alex Braylan, Mark Hollenbeck, Elliot Meyerson, and Risto Miikkulainen. Frame skip is a powerful parameter for learning to play atari. Space, 1600:1800, 2005.
- Greg Brockman, Vicki Cheung, Ludwig Pettersson, Jonas Schneider, John Schulman, Jie Tang, and Wojciech Zaremba. OpenAI Gym. arXiv preprint arXiv:1606.01540, 2016.
- Yan Duan, Xi Chen, Rein Houthooft, John Schulman, and Pieter Abbeel. Benchmarking deep reinforcement learning for continuous control. In Proceedings of the 33rd International Conference on Machine Learning (ICML), 2016a.
- Yan Duan, John Schulman, Xi Chen, Peter L Bartlett, Ilya Sutskever, and Pieter Abbeel. RL2: Fast reinforcement learning via slow reinforcement learning. arXiv preprint arXiv:1611.02779, 2016b.
- John C Duchi, Michael I Jordan, Martin J Wainwright, and Andre Wibisono. Optimal rates for zero-order convex optimization: The power of two function evaluations. IEEE Transactions on Information Theory, 61(5):2788–2806, 2015.
- John Geweke. Antithetic acceleration of monte carlo integration in bayesian inference. Journal of Econometrics, 38(1-2):73–89, 1988.
- Springer, 2010a.
- Tobias Glasmachers, Tom Schaul, Sun Yi, Daan Wierstra, and Jürgen Schmidhuber. Exponential natural evolution strategies. In Proceedings of the 12th annual conference on Genetic and evolutionary computation, pages 393–400. ACM, 2010b.
- Nikolaus Hansen and Andreas Ostermeier. Completely derandomized self-adaptation in evolution strategies. Evolutionary computation, 9(2):159–195, 2001.
- Matthew Hausknecht, Joel Lehman, Risto Miikkulainen, and Peter Stone. A neuroevolution approach to general atari game playing. IEEE Transactions on Computational Intelligence and AI in Games, 6(4):355–366, 2014.
- Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. arXiv preprint arXiv:1502.03167, 2015.
- Norman P Jouppi, Cliff Young, Nishant Patil, David Patterson, Gaurav Agrawal, Raminder Bajwa, Sarah Bates, Suresh Bhatia, Nan Boden, Al Borchers, et al. In-datacenter performance analysis of a tensor processing unit. arXiv preprint arXiv:1704.04760, 2017.
- Kenji Kawaguchi. Deep learning without poor local minima. In Advances In Neural Information Processing Systems, pages 586–594, 2016.
- Jan Koutník, Faustino Gomez, and Jürgen Schmidhuber. Evolving neural networks in compressed weight space. In Proceedings of the 12th annual conference on Genetic and evolutionary computation, pages 619–626. ACM, 2010.
- Jan Koutník, Giuseppe Cuccu, Jürgen Schmidhuber, and Faustino Gomez. Evolving large-scale neural networks for vision-based reinforcement learning. In Proceedings of the 15th annual conference on Genetic and evolutionary computation, pages 1061–1068. ACM, 2013.
- Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A Rusu, Joel Veness, Marc G Bellemare, Alex Graves, Martin Riedmiller, Andreas K Fidjeland, Georg Ostrovski, et al. Human-level control through deep reinforcement learning. Nature, 518(7540):529–533, 2015.
- Volodymyr Mnih, Adria Puigdomenech Badia, Mehdi Mirza, Alex Graves, Timothy P Lillicrap, Tim Harley, David Silver, and Koray Kavukcuoglu. Asynchronous methods for deep reinforcement learning. In International Conference on Machine Learning, 2016.
- Yurii Nesterov. Efficiency of coordinate descent methods on huge-scale optimization problems. SIAM Journal on Optimization, 22(2):341–362, 2012.
- Yurii Nesterov and Vladimir Spokoiny. Random gradient-free minimization of convex functions. Foundations of Computational Mathematics, pages 1–40, 2011.
- Andrew Ng, Adam Coates, Mark Diel, Varun Ganapathi, Jamie Schulte, Ben Tse, Eric Berger, and Eric Liang. Autonomous inverted helicopter flight via reinforcement learning. Experimental Robotics IX, pages 363–372, 2006.
- Ronald Parr and Stuart Russell. Reinforcement learning with hierarchies of machines. Advances in neural information processing systems, pages 1043–1049, 1998.
- I. Rechenberg and M. Eigen. Evolutionsstrategie: Optimierung Technischer Systeme nach Prinzipien der Biologischen Evolution. Frommann-Holzboog Stuttgart, 1973.
- Sebastian Risi and Julian Togelius. Neuroevolution in games: State of the art and open challenges. IEEE Transactions on Computational Intelligence and AI in Games, 2015.
- Tim Salimans, Ian Goodfellow, Wojciech Zaremba, Vicki Cheung, Alec Radford, and Xi Chen. Improved techniques for training gans. In Advances in Neural Information Processing Systems, pages 2226–2234, 2016.
- Tom Schaul, Tobias Glasmachers, and Jürgen Schmidhuber. High dimensions and heavy tails for natural evolution strategies. In Proceedings of the 13th annual conference on Genetic and evolutionary computation, pages 845–852. ACM, 2011.
- Juergen Schmidhuber and Jieyu Zhao. Direct policy search and uncertain policy evaluation. In Aaai spring symposium on search under uncertain and incomplete information, stanford univ, pages 119–124, 1998.
- Jürgen Schmidhuber, Daan Wierstra, Matteo Gagliolo, and Faustino Gomez. Training recurrent networks by evolino. Neural computation, 19(3):757–779, 2007.
- John Schulman, Sergey Levine, Pieter Abbeel, Michael I Jordan, and Philipp Moritz. Trust region policy optimization. In ICML, pages 1889–1897, 2015.
- H.-P. Schwefel. Numerische optimierung von computer-modellen mittels der evolutionsstrategie. 1977.
- Frank Sehnke, Christian Osendorfer, Thomas Rückstieß, Alex Graves, Jan Peters, and Jürgen Schmidhuber. Parameter-exploring policy gradients. Neural Networks, 23(4):551–559, 2010.
- David Silver, Aja Huang, Chris J Maddison, Arthur Guez, Laurent Sifre, George Van Den Driessche, Julian Schrittwieser, Ioannis Antonoglou, Veda Panneershelvam, Marc Lanctot, et al. Mastering the game of go with deep neural networks and tree search. Nature, 529(7587):484–489, 2016.
- James C Spall. Multivariate stochastic approximation using a simultaneous perturbation gradient approximation. IEEE transactions on automatic control, 37(3):332–341, 1992.
- Kenneth O Stanley, David B D’Ambrosio, and Jason Gauci. A hypercube-based encoding for evolving large-scale neural networks. Artificial life, 15(2):185–212, 2009.
- Freek Stulp and Olivier Sigaud. Policy improvement methods: Between black-box optimization and episodic reinforcement learning. 2012.
- Yi Sun, Daan Wierstra, Tom Schaul, and Juergen Schmidhuber. Efficient natural evolution strategies. In Proceedings of the 11th Annual conference on Genetic and evolutionary computation, pages 539–546. ACM, 2009.
- Emanuel Todorov, Tom Erez, and Yuval Tassa. Mujoco: A physics engine for model-based control. In Intelligent Robots and Systems (IROS), 2012 IEEE/RSJ International Conference on, pages 5026–5033. IEEE, 2012.
- Nicolas Usunier, Gabriel Synnaeve, Zeming Lin, and Soumith Chintala. Episodic exploration for deep deterministic policies: An application to starcraft micromanagement tasks. arXiv preprint arXiv:1609.02993, 2016.
- Sjoerd van Steenkiste, Jan Koutník, Kurt Driessens, and Jürgen Schmidhuber. A wavelet-based encoding for neuroevolution. In Proceedings of the 2016 on Genetic and Evolutionary Computation Conference, pages 517–524. ACM, 2016.
- Daan Wierstra, Tom Schaul, Jan Peters, and Juergen Schmidhuber. Natural evolution strategies. In Evolutionary Computation, 2008. CEC 2008.(IEEE World Congress on Computational Intelligence). IEEE Congress on, pages 3381–3387. IEEE, 2008.
- Daan Wierstra, Tom Schaul, Tobias Glasmachers, Yi Sun, Jan Peters, and Jürgen Schmidhuber. Natural evolution strategies. Journal of Machine Learning Research, 15(1):949–980, 2014.
- Ronald J Williams. Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine learning, 8(3-4):229–256, 1992.
- Kelvin Xu, Jimmy Ba, Ryan Kiros, Kyunghyun Cho, Aaron C Courville, Ruslan Salakhutdinov, Richard S Zemel, and Yoshua Bengio. Show, attend and tell: Neural image caption generation with visual attention. In ICML, volume 14, pages 77–81, 2015.
- Sun Yi, Daan Wierstra, Tom Schaul, and Jürgen Schmidhuber. Stochastic search using the natural gradient. In Proceedings of the 26th Annual International Conference on Machine Learning, pages 1161–1168. ACM, 2009.

Tags

Comments