DARTS: Differentiable Architecture Search

ICLR, Volume abs/1806.09055, 2019.

Cited by: 674|Bibtex|Views25|Links
EI
Keywords:
architecture search algorithmnon differentiableefficient architecture searchimage classificationlanguage modelingMore(4+)
Wei bo:
We presented DARTS, a simple yet efficient architecture search algorithm for both convolutional and recurrent networks

Abstract:

This paper addresses the scalability challenge of architecture search by formulating the task in a differentiable manner. Unlike conventional approaches of applying evolution or reinforcement learning over a discrete and non-differentiable search space, our method is based on the continuous relaxation of the architecture representation, a...More

Code:

Data:

Introduction
  • Discovering state-of-the-art neural network architectures requires substantial effort of human experts.
  • The automatically searched architectures have achieved highly competitive performance in tasks such as image classification (Zoph & Le, 2017; Zoph et al, 2018; Liu et al, 2018b;a; Real et al, 2018) and object detection (Zoph et al, 2018).
  • Obtaining a state-of-the-art architecture for CIFAR-10 and ImageNet required 2000 GPU days of reinforcement learning (RL) (Zoph et al, 2018) or 3150 GPU days of evolution (Real et al, 2018).
  • An inherent cause of inefficiency for the dominant approaches, e.g. based on RL, evolution, MCTS (Negrinho & Gordon, 2017), SMBO (Liu et al, 2018a) or Bayesian optimization (Kandasamy et al, 2018), is the fact that architecture search is treated as a black-box optimization problem over a discrete domain, which leads to a large number of architecture evaluations required
Highlights
  • Discovering state-of-the-art neural network architectures requires substantial effort of human experts
  • Through extensive experiments on image classification and language modeling tasks we show that gradient-based architecture search achieves highly competitive results on CIFAR-10 and outperforms the state of the art on Penn Treebank
  • We show that the architectures learned by DARTS on CIFAR-10 and Penn Treebank are transferable to ImageNet and WikiText-2, respectively
  • Our experiments on CIFAR-10 and Penn Treebank consist of two stages, architecture search (Sect. 3.1) and architecture evaluation (Sect. 3.2)
  • We investigate the transferability of the best cells learned on CIFAR-10 and Penn Treebank by evaluating them on ImageNet and WikiText-2 (WT2) respectively
  • We presented DARTS, a simple yet efficient architecture search algorithm for both convolutional and recurrent networks
Methods
  • DenseNet-BC (Huang et al, 2017) manual

    NASNet-A + cutout (Zoph et al, 2018)

    NASNet-A + cutout (Zoph et al, 2018)†

    BlockQNN (Zhong et al, 2018)

    AmoebaNet-A (Real et al, 2018)

    AmoebaNet-A + cutout (Real et al, 2018)†

    PNAS (Liu et al, 2018a)

    ENAS + cutout (Pham et al, 2018b)

    ENAS + cutout (Pham et al, 2018b)*

    RL evolution evolution.
  • DenseNet-BC (Huang et al, 2017) manual.
  • NASNet-A + cutout (Zoph et al, 2018)†.
  • BlockQNN (Zhong et al, 2018).
  • AmoebaNet-A (Real et al, 2018).
  • AmoebaNet-A + cutout (Real et al, 2018)†.
  • PNAS (Liu et al, 2018a).
  • ENAS + cutout (Pham et al, 2018b)*
Results
  • The authors' experiments on CIFAR-10 and PTB consist of two stages, architecture search (Sect. 3.1) and architecture evaluation (Sect. 3.2).
  • The resulting best convolutional cell yielded 4.16 ± 0.16% test error using 3.1M parameters, which is worse than random search.
  • The resulting best cell yielded 3.56 ± 0.10% test error using 3.0M parameters
  • The authors hypothesize that these heuristics would cause α to overfit the training data, leading to poor generalization.
  • 3 show that the cell learned on CIFAR-10 is transferable to ImageNet. It is worth noticing that DARTS achieves competitive performance with the state-of-the-art RL method (Zoph et al, 2018) while using three orders of magnitude less computation resources.
  • The issue of transferability could potentially be circumvented by directly optimizing the architecture on the task of interest.
Conclusion
  • The authors presented DARTS, a simple yet efficient architecture search algorithm for both convolutional and recurrent networks.
  • The current method may suffer from discrepancies between the continuous architecture encoding and the derived discrete architecture.
  • This could be alleviated, e.g., by annealing the softmax temperature to enforce one-hot selection.
  • It would be interesting to investigate performance-aware architecture derivation schemes based on the shared parameters learned during the search process
Summary
  • Introduction:

    Discovering state-of-the-art neural network architectures requires substantial effort of human experts.
  • The automatically searched architectures have achieved highly competitive performance in tasks such as image classification (Zoph & Le, 2017; Zoph et al, 2018; Liu et al, 2018b;a; Real et al, 2018) and object detection (Zoph et al, 2018).
  • Obtaining a state-of-the-art architecture for CIFAR-10 and ImageNet required 2000 GPU days of reinforcement learning (RL) (Zoph et al, 2018) or 3150 GPU days of evolution (Real et al, 2018).
  • An inherent cause of inefficiency for the dominant approaches, e.g. based on RL, evolution, MCTS (Negrinho & Gordon, 2017), SMBO (Liu et al, 2018a) or Bayesian optimization (Kandasamy et al, 2018), is the fact that architecture search is treated as a black-box optimization problem over a discrete domain, which leads to a large number of architecture evaluations required
  • Methods:

    DenseNet-BC (Huang et al, 2017) manual

    NASNet-A + cutout (Zoph et al, 2018)

    NASNet-A + cutout (Zoph et al, 2018)†

    BlockQNN (Zhong et al, 2018)

    AmoebaNet-A (Real et al, 2018)

    AmoebaNet-A + cutout (Real et al, 2018)†

    PNAS (Liu et al, 2018a)

    ENAS + cutout (Pham et al, 2018b)

    ENAS + cutout (Pham et al, 2018b)*

    RL evolution evolution.
  • DenseNet-BC (Huang et al, 2017) manual.
  • NASNet-A + cutout (Zoph et al, 2018)†.
  • BlockQNN (Zhong et al, 2018).
  • AmoebaNet-A (Real et al, 2018).
  • AmoebaNet-A + cutout (Real et al, 2018)†.
  • PNAS (Liu et al, 2018a).
  • ENAS + cutout (Pham et al, 2018b)*
  • Results:

    The authors' experiments on CIFAR-10 and PTB consist of two stages, architecture search (Sect. 3.1) and architecture evaluation (Sect. 3.2).
  • The resulting best convolutional cell yielded 4.16 ± 0.16% test error using 3.1M parameters, which is worse than random search.
  • The resulting best cell yielded 3.56 ± 0.10% test error using 3.0M parameters
  • The authors hypothesize that these heuristics would cause α to overfit the training data, leading to poor generalization.
  • 3 show that the cell learned on CIFAR-10 is transferable to ImageNet. It is worth noticing that DARTS achieves competitive performance with the state-of-the-art RL method (Zoph et al, 2018) while using three orders of magnitude less computation resources.
  • The issue of transferability could potentially be circumvented by directly optimizing the architecture on the task of interest.
  • Conclusion:

    The authors presented DARTS, a simple yet efficient architecture search algorithm for both convolutional and recurrent networks.
  • The current method may suffer from discrepancies between the continuous architecture encoding and the derived discrete architecture.
  • This could be alleviated, e.g., by annealing the softmax temperature to enforce one-hot selection.
  • It would be interesting to investigate performance-aware architecture derivation schemes based on the shared parameters learned during the search process
Tables
  • Table1: Comparison with state-of-the-art image classifiers on CIFAR-10 (lower error rate is better). Note the search cost for DARTS does not include the selection cost (1 GPU day) or the final evaluation cost by training the selected architecture from scratch (1.5 GPU days)
  • Table2: Comparison with state-of-the-art language models on PTB (lower perplexity is better). Note the search cost for DARTS does not include the selection cost (1 GPU day) or the final evaluation cost by training the selected architecture from scratch (3 GPU days)
  • Table3: Comparison with state-of-the-art image classifiers on ImageNet in the mobile setting
  • Table4: Comparison with state-of-the-art language models on WT2
Download tables as Excel
Funding
  • Addresses the scalability challenge of architecture search by formulating the task in a differentiable manner
  • Shows that DARTS is able to design a convolutional cell that achieves 2.76 ± 0.09% test error on CIFAR-10 for image classification using 3.3M parameters, which is competitive with the state-of-the-art result by regularized evolution obtained using three orders of magnitude more computation resources
  • Introduces a novel algorithm for differentiable network architecture search based on bilevel optimization, which is applicable to both convolutional and recurrent architectures
  • Shows that gradient-based architecture search achieves highly competitive results on CIFAR-10 and outperforms the state of the art on PTB
  • Shows that the architectures learned by DARTS on CIFAR-10 and PTB are transferable to ImageNet and WikiText-2, respectively
Study subjects and analysis
samples: 24
* Obtained by repeating ENAS for 8 times using the code publicly released by the authors. The cell for final evaluation is chosen according to the same selection protocol as for DARTS. † Obtained by training the corresponding architectures using our setup. ‡ Best architecture among 24 samples according to the validation error after 100 training epochs. Perplexity valid test

samples: 8
4 gradient-based. * Obtained using the code (Pham et al, 2018a) publicly released by the authors. † Obtained by training the corresponding architecture using our setup. ‡ Best architecture among 8 samples according to the validation perplexity after 300 training epochs. Test Error (%) Params +× Search Cost top-1 top-5 (M) (M) (GPU days)

Reference
  • Karim Ahmed and Lorenzo Torresani. Connectivity learning in multi-branch networks. arXiv preprint arXiv:1709.09582, 2017.
    Findings
  • G Anandalingam and TL Friesz. Hierarchical optimization: An introduction. Annals of Operations Research, 34(1):1–11, 1992.
    Google ScholarLocate open access versionFindings
  • Bowen Baker, Otkrist Gupta, Nikhil Naik, and Ramesh Raskar. Designing neural network architectures using reinforcement learning. ICLR, 2017.
    Google ScholarLocate open access versionFindings
  • Bowen Baker, Otkrist Gupta, Ramesh Raskar, and Nikhil Naik. Accelerating neural architecture search using performance prediction. ICLR Workshop, 2018.
    Google ScholarLocate open access versionFindings
  • Gabriel Bender, Pieter-Jan Kindermans, Barret Zoph, Vijay Vasudevan, and Quoc Le. Understanding and simplifying one-shot architecture search. In International Conference on Machine Learning, pp. 549–558, 2018.
    Google ScholarLocate open access versionFindings
  • Andrew Brock, Theodore Lim, James M Ritchie, and Nick Weston. Smash: one-shot model architecture search through hypernetworks. ICLR, 2018.
    Google ScholarLocate open access versionFindings
  • Han Cai, Tianyao Chen, Weinan Zhang, Yong Yu, and Jun Wang. Efficient architecture search by network transformation. AAAI, 2018.
    Google ScholarLocate open access versionFindings
  • Benoît Colson, Patrice Marcotte, and Gilles Savard. An overview of bilevel optimization. Annals of operations research, 153(1):235–256, 2007.
    Google ScholarLocate open access versionFindings
  • Terrance DeVries and Graham W Taylor. Improved regularization of convolutional neural networks with cutout. arXiv preprint arXiv:1708.04552, 2017.
    Findings
  • Thomas Elsken, Jan-Hendrik Metzen, and Frank Hutter. Simple and efficient architecture search for convolutional neural networks. arXiv preprint arXiv:1711.04528, 2017.
    Findings
  • Chelsea Finn, Pieter Abbeel, and Sergey Levine. Model-agnostic meta-learning for fast adaptation of deep networks. In ICML, pp. 1126–1135, 2017.
    Google ScholarLocate open access versionFindings
  • Luca Franceschi, Paolo Frasconi, Saverio Salzo, and Massimilano Pontil. Bilevel programming for hyperparameter optimization and meta-learning. ICML, 2018.
    Google ScholarLocate open access versionFindings
  • Yarin Gal and Zoubin Ghahramani. A theoretically grounded application of dropout in recurrent neural networks. In NIPS, pp. 1019–1027, 2016.
    Google ScholarLocate open access versionFindings
  • Edouard Grave, Armand Joulin, and Nicolas Usunier. Improving neural language models with a continuous cache. arXiv preprint arXiv:1612.04426, 2016.
    Findings
  • Andrew G Howard, Menglong Zhu, Bo Chen, Dmitry Kalenichenko, Weijun Wang, Tobias Weyand, Marco Andreetto, and Hartwig Adam. Mobilenets: Efficient convolutional neural networks for mobile vision applications. arXiv preprint arXiv:1704.04861, 2017.
    Findings
  • Gao Huang, Zhuang Liu, Kilian Q Weinberger, and Laurens van der Maaten. Densely connected convolutional networks. In CVPR, 2017.
    Google ScholarLocate open access versionFindings
  • Hakan Inan, Khashayar Khosravi, and Richard Socher. Tying word vectors and word classifiers: A loss framework for language modeling. ICLR, 2017.
    Google ScholarLocate open access versionFindings
  • Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. arXiv preprint arXiv:1502.03167, 2015.
    Findings
  • Kirthevasan Kandasamy, Willie Neiswanger, Jeff Schneider, Barnabas Poczos, and Eric Xing. Neural architecture search with bayesian optimisation and optimal transport. NIPS, 2018.
    Google ScholarLocate open access versionFindings
  • Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
    Findings
  • Ben Krause, Emmanuel Kahembwe, Iain Murray, and Steve Renals. Dynamic evaluation of neural sequence models. arXiv preprint arXiv:1709.07432, 2017.
    Findings
  • Chenxi Liu, Barret Zoph, Jonathon Shlens, Wei Hua, Li-Jia Li, Li Fei-Fei, Alan Yuille, Jonathan Huang, and Kevin Murphy. Progressive neural architecture search. ECCV, 2018a.
    Google ScholarLocate open access versionFindings
  • Hanxiao Liu, Karen Simonyan, Oriol Vinyals, Chrisantha Fernando, and Koray Kavukcuoglu. Hierarchical representations for efficient architecture search. ICLR, 2018b.
    Google ScholarFindings
  • Ilya Loshchilov and Frank Hutter. Sgdr: Stochastic gradient descent with warm restarts. arXiv preprint arXiv:1608.03983, 2016.
    Findings
  • Jelena Luketina, Mathias Berglund, Klaus Greff, and Tapani Raiko. Scalable gradient-based tuning of continuous regularization hyperparameters. In ICML, pp. 2952–2960, 2016.
    Google ScholarLocate open access versionFindings
  • Dougal Maclaurin, David Duvenaud, and Ryan Adams. Gradient-based hyperparameter optimization through reversible learning. In ICML, pp. 2113–2122, 2015.
    Google ScholarLocate open access versionFindings
  • Gábor Melis, Chris Dyer, and Phil Blunsom. On the state of the art of evaluation in neural language models. ICLR, 2018.
    Google ScholarLocate open access versionFindings
  • Stephen Merity, Nitish Shirish Keskar, and Richard Socher. Regularizing and optimizing lstm language models. ICLR, 2018.
    Google ScholarLocate open access versionFindings
  • Luke Metz, Ben Poole, David Pfau, and Jascha Sohl-Dickstein. Unrolled generative adversarial networks. ICLR, 2017.
    Google ScholarLocate open access versionFindings
  • Renato Negrinho and Geoff Gordon. Deeparchitect: Automatically designing and training deep architectures. arXiv preprint arXiv:1704.08792, 2017.
    Findings
  • Adam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan, Edward Yang, Zachary DeVito, Zeming Lin, Alban Desmaison, Luca Antiga, and Adam Lerer. Automatic differentiation in pytorch. In NIPS-W, 2017.
    Google ScholarLocate open access versionFindings
  • A large network of 20 cells is trained for 600 epochs with batch size 96. The initial number of channels is increased from 16 to 36 to ensure our model size is comparable with other baselines in the literature (around 3M). Other hyperparameters remain the same as the ones used for architecture search. Following existing works (Pham et al., 2018b; Zoph et al., 2018; Liu et al., 2018a; Real et al., 2018), additional enhancements include cutout (DeVries & Taylor, 2017), path dropout of probability 0.2 and auxiliary towers with weight 0.4. The training takes 1.5 days on a single GPU with our implementation in PyTorch (Paszke et al., 2017). Since the CIFAR results are subject to high variance even with exactly the same setup (Liu et al., 2018b), we report the mean and standard deviation of 10 independent runs for our full model.
    Google ScholarLocate open access versionFindings
  • To avoid any discrepancy between different implementations or training settings (e.g. the batch sizes), we incorporated the NASNet-A cell (Zoph et al., 2018) and the AmoebaNet-A cell (Real et al., 2018) into our training framework and reported their results under the same settings as our cells.
    Google ScholarFindings
  • A network of 14 cells is trained for 250 epochs with batch size 128, weight decay 3 × 10−5 and initial SGD learning rate 0.1 (decayed by a factor of 0.97 after each epoch). Other hyperparameters follow Zoph et al. (2018); Real et al. (2018); Liu et al. (2018a)4. The training takes 12 days on a single GPU.
    Google ScholarLocate open access versionFindings
  • DAGs without considering graph isomorphism (recall we have 7 non-zero ops, 2 input nodes, 4 intermediate nodes with 2 predecessors each). Since we are jointly learning both normal and reduction cells, the total number of architectures is approximately (109)2 = 1018. This is greater than the 5.6 × 1014 of PNAS (Liu et al., 2018a) which learns only a single type of cell.
    Google ScholarLocate open access versionFindings
Your rating :
0

 

Tags
Comments