Sequence Level Training with Recurrent Neural Networks

international conference on learning representations, 2015.

Cited by: 818|Bibtex|Views89|Links
EI
Keywords:
Mixed Incremental Cross-Entropy Reinforcelevel trainingnatural languageConvolutional Neural Networkentire sequenceMore(12+)
Weibo:
We propose the Mixed Incremental Cross-Entropy Reinforce algorithm, which deals with these issues and enables successful training of reinforcement learning models for text generation

Abstract:

Many natural language processing applications use language models to generate text. These models are typically trained to predict the next word in a sequence, given the previous words and some context such as an image. However, at test time the model is expected to generate the entire sequence from scratch. This discrepancy makes genera...More

Code:

Data:

0
Introduction
  • Natural language is the most natural form of communication for humans. It is essential that interactive AI systems are capable of generating text (Reiter & Dale, 2000).
  • Popular choices for text generation models are language models based on n-grams (Kneser & Ney, 1995), feed-forward neural networks (Morin & Bengio, 2005), and recurrent neural networks (RNNs; Mikolov et al, 2010).
  • These models when used as is to generate text suffer from two major drawbacks.
  • Prior attempts (McAllester et al, 2010; He & Deng, 2012) at optimizing test metrics were restricted to linear models, or required a large number of samples to work well (Auli & Gao, 2014)
Highlights
  • Natural language is the most natural form of communication for humans
  • Data as Demonstrator is usually better than the XENT, but not as good as Mixed Incremental Cross-Entropy Reinforce
  • Our work is motivated by two major deficiencies in training the current generative models for text generation: exposure bias and a loss which does not operate at the sequence level
  • We propose the Mixed Incremental Cross-Entropy Reinforce algorithm, which deals with these issues and enables successful training of reinforcement learning models for text generation
  • Our results show that Mixed Incremental Cross-Entropy Reinforce outperforms three strong baselines for greedy generation and it is very competitive with beam search
  • Our training algorithm relies on a single sample while it would be interesting to investigate the effect of more comprehensive search methods at training time
Methods
  • The authors train conditional RNNs by unfolding them up to a certain maximum length
  • The authors chose this length to cover about 95% of the target sentences in the data sets the authors consider.
  • MIXER: china official dismisses reports of #g mobile licenses greece risks bankruptcy if it does not take radical extra measures to fix its finances , prime minister george papandreou warned on tuesday , saying the country was in a ‘‘ wartime situation.
  • Greece threatens to measures to finances greece does not take radical measures to deficit the indonesian police were close to identify the body parts resulted from the deadly explosion in front of the australian embassy by the dna test , police chief general said on wednesday
Results
  • In order to validate MIXER, the authors compute BLEU score on the machine translation and image captioning task, and ROUGE on the summarization task.
  • The scores on the test set are reported in Figure 5.
  • The authors observe that MIXER produces the best generations and improves generation over XENT by 1 to 3 points across all the tasks.
  • Training at the sequence level and directly optimizing for testing score yields better generations than turning a sequence of discrete decisions into a differentiable process amenable to standard back-propagation of the error.
  • DAD is usually better than the XENT, but not as good as MIXER
Conclusion
  • The authors' work is motivated by two major deficiencies in training the current generative models for text generation: exposure bias and a loss which does not operate at the sequence level.
  • The authors propose the MIXER algorithm, which deals with these issues and enables successful training of reinforcement learning models for text generation.
  • The authors achieve this by replacing the initial random policy with the optimal policy of a cross-entropy trained model and by gradually exposing the model more and more to its own predictions in an incremental learning framework.
  • The authors' training algorithm relies on a single sample while it would be interesting to investigate the effect of more comprehensive search methods at training time
Summary
  • Introduction:

    Natural language is the most natural form of communication for humans. It is essential that interactive AI systems are capable of generating text (Reiter & Dale, 2000).
  • Popular choices for text generation models are language models based on n-grams (Kneser & Ney, 1995), feed-forward neural networks (Morin & Bengio, 2005), and recurrent neural networks (RNNs; Mikolov et al, 2010).
  • These models when used as is to generate text suffer from two major drawbacks.
  • Prior attempts (McAllester et al, 2010; He & Deng, 2012) at optimizing test metrics were restricted to linear models, or required a large number of samples to work well (Auli & Gao, 2014)
  • Methods:

    The authors train conditional RNNs by unfolding them up to a certain maximum length
  • The authors chose this length to cover about 95% of the target sentences in the data sets the authors consider.
  • MIXER: china official dismisses reports of #g mobile licenses greece risks bankruptcy if it does not take radical extra measures to fix its finances , prime minister george papandreou warned on tuesday , saying the country was in a ‘‘ wartime situation.
  • Greece threatens to measures to finances greece does not take radical measures to deficit the indonesian police were close to identify the body parts resulted from the deadly explosion in front of the australian embassy by the dna test , police chief general said on wednesday
  • Results:

    In order to validate MIXER, the authors compute BLEU score on the machine translation and image captioning task, and ROUGE on the summarization task.
  • The scores on the test set are reported in Figure 5.
  • The authors observe that MIXER produces the best generations and improves generation over XENT by 1 to 3 points across all the tasks.
  • Training at the sequence level and directly optimizing for testing score yields better generations than turning a sequence of discrete decisions into a differentiable process amenable to standard back-propagation of the error.
  • DAD is usually better than the XENT, but not as good as MIXER
  • Conclusion:

    The authors' work is motivated by two major deficiencies in training the current generative models for text generation: exposure bias and a loss which does not operate at the sequence level.
  • The authors propose the MIXER algorithm, which deals with these issues and enables successful training of reinforcement learning models for text generation.
  • The authors achieve this by replacing the initial random policy with the optimal policy of a cross-entropy trained model and by gradually exposing the model more and more to its own predictions in an incremental learning framework.
  • The authors' training algorithm relies on a single sample while it would be interesting to investigate the effect of more comprehensive search methods at training time
Tables
  • Table1: Text generation models can be described across three dimensions: whether they suffer from exposure bias, whether they are trained in an end-to-end manner using back-propagation, and whether they are trained to predict one word ahead or the whole sequence
  • Table2: Best scheduling parameters found by hyper-parameter search of MIXER
Download tables as Excel
Related work
  • Sequence models are typically trained to predict the next word using the cross-entropy loss. At test time, it is common to use beam search to explore multiple alternative paths (Sutskever et al, 2014; Bahdanau et al, 2015; Rush et al, 2015). While this improves generation by typically one or two BLEU points (Papineni et al, 2002), it makes the generation at least k times slower, where k is the number of active paths in the beam (see Sec. 3.1.1 for more details).

    The idea of improving generation by letting the model use its own predictions at training time (the key proposal of this work) was first advocated by Daume III et al (2009). In their seminal work, the authors first noticed that structured prediction problems can be cast as a particular instance of reinforcement learning. They then proposed SEARN, an algorithm to learn such structured prediction tasks. The basic idea is to let the model use its own predictions at training time to produce a sequence of actions (e.g., the choice of the next word). Then, a search algorithm is run to determine the optimal action at each time step, and a classifier (a.k.a. policy) is trained to predict that action. A similar idea was later proposed by Ross et al (2011) in an imitation learning framework. Unfortunately, for text generation it is generally intractable to compute an oracle of the optimal target word given the words predicted so far. The oracle issue was later addressed by an algorithm called Data As Demonstrator (DAD) (Venkatraman et al, 2015) and applied for text generation by Bengio et al (2015), whereby the target action at step k is the k-th action taken by the optimal policy (ground truth sequence) regardless of which input is fed to the system, whether it is ground truth, or the model’s prediction. While DAD usually improves generation, it seems unsatisfactory to force the model to predict a certain word regardless of the preceding words (see sec. 3.1.2 for more details).
Reference
  • Auli, M. and Gao, J. Decoder integration and expected bleu training for recurrent neural network language models. In Proc. of ACL, June 2014.
    Google ScholarLocate open access versionFindings
  • Bahdanau, D., Cho, K., and Bengio, Y. Neural machine translation by jointly learning to align and translate. In ICLR, 2015.
    Google ScholarLocate open access versionFindings
  • Bengio, S., Vinyals, O., Jaitly, N., and Shazeer, N. Scheduled sampling for sequence prediction with recurrent neural networks. In NIPS, 2015.
    Google ScholarLocate open access versionFindings
  • Cettolo, M., Niehues, J., Stuker, S., Bentivogli, L.,, and Federico, M. Report on the 11th iwslt evaluation campaign. In Proc. of IWSLT, 2014.
    Google ScholarLocate open access versionFindings
  • Daume III, H., Langford, J., and Marcu, D. Search-based structured prediction as classification. Machine Learning Journal, 2009.
    Google ScholarLocate open access versionFindings
  • Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., and Fei-Fei, L. Imagenet: a large-scale hierarchical image database. In IEEE Conference on Computer Vision and Pattern Recognition, 2009.
    Google ScholarLocate open access versionFindings
  • Elman, Jeffrey L. Finding structure in time. Cognitive Science, 14(2):179–211, 1990.
    Google ScholarLocate open access versionFindings
  • Graff, D., Kong, J., Chen, K., and Maeda, K. English gigaword. Technical report, 2003.
    Google ScholarFindings
  • Graves, A. and Jaitly, N. Towards end-to-end speech recognition with recurrent neural networks. In ICML, 2014.
    Google ScholarLocate open access versionFindings
  • He, X. and Deng, L. Maximum expected bleu training of phrase and lexicon translation models. In ACL, 2012.
    Google ScholarLocate open access versionFindings
  • Hochreiter, S. and Schmidhuber, J. Long short-term memory. Neural Computation, 9(8):1735–1780, 1997.
    Google ScholarLocate open access versionFindings
  • Kneser, Reinhard and Ney, Hermann. Improved backing-off for M-gram language modeling. In Proc. of the International Conference on Acoustics, Speech, and Signal Processing, pp. 181–184, May 1995.
    Google ScholarLocate open access versionFindings
  • Koehn, Philipp, Hoang, Hieu, Birch, Alexandra, Callison-Burch, Chris, Federico, Marcello, Bertoldi, Nicola, Cowan, Brooke, Shen, Wade, Moran, Christine, Zens, Richard, Dyer, Chris, Bojar, Ondrej, Constantin, Alexandra, and Herbst, Evan. Moses: Open source toolkit for statistical machine translation. In Proc. of ACL Demo and Poster Sessions, Jun 2007.
    Google ScholarLocate open access versionFindings
  • Liang, Percy, Bouchard-Cote, Alexandre, Taskar, Ben, and Klein, Dan. An end-to-end discriminative approach to machine translation. In acl-coling2006, pp. 761–768, Jul 2006.
    Google ScholarLocate open access versionFindings
  • Lin, C.Y. and Hovy, E.H. Automatic evaluation of summaries using n-gram co-occurrence statistics. In HLT-NAACL, 2003.
    Google ScholarFindings
  • Lin, T.Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Dollar, P., and Zitnick, C.L. Microsoft coco: Common objects in context. Technical report, 2014.
    Google ScholarFindings
  • McAllester, D., Hazan, T., and Keshet, J. Direct loss minimization for structured prediction. In NIPS, 2010.
    Google ScholarLocate open access versionFindings
  • Mikolov, T., Karafit, M., Burget, L., Cernock, J., and Khudanpur, S. Recurrent neural network based language model. In INTERSPEECH, 2010.
    Google ScholarLocate open access versionFindings
  • Mnih, V., Heess N., Graves, A., and Kavukcuoglu, K. Recurrent models of visual attention. In NIPS, 2014.
    Google ScholarLocate open access versionFindings
  • Morin, F. and Bengio, Y. Hierarchical probabilistic neural network language model. In AISTATS, 2005.
    Google ScholarLocate open access versionFindings
  • Papineni, K., Roukos, S., Ward, T., and Zhu, W.J. Bleu: a method for automatic evaluation of machine translation. In ACL, 2002.
    Google ScholarLocate open access versionFindings
  • Reiter, E. and Dale, R. Building natural language generation systems. Cambridge university press, 2000.
    Google ScholarFindings
  • Ross, S., Gordon, G.J., and Bagnell, J.A. A reduction of imitation learning and structured prediction to no-regret online learning. In AISTATS, 2011.
    Google ScholarLocate open access versionFindings
  • Rosti, Antti-Veikko I, Zhang, Bing, Matsoukas, Spyros, and Schwartz, Richard. Expected bleu training for graphs: Bbn system description for wmt11 system combination task. In Proc. of WMT, pp. 159–165. Association for Computational Linguistics, July 2011.
    Google ScholarLocate open access versionFindings
  • Rumelhart, D.E., Hinton, G.E., and Williams, R.J. Learning internal representations by backpropagating errors. Nature, 323:533–536, 1986.
    Google ScholarLocate open access versionFindings
  • Rush, A.M., Chopra, S., and Weston, J. A neural attention model for abstractive sentence summarization. In EMNLP, 2015.
    Google ScholarLocate open access versionFindings
  • Sutskever, Ilya, Vinyals, Oriol, and Le, Quoc. Sequence to sequence learning with neural networks. In Proc. of NIPS, 2014.
    Google ScholarLocate open access versionFindings
  • Sutton, R.S. and Barto, A.G. Reinforcement learning: An introduction. MIT Press, 1988.
    Google ScholarFindings
  • Venkatraman, A., Hebert, M., and Bagnell, J.A. Improving multi-step prediction of learned time series models. In AAAI, 2015.
    Google ScholarLocate open access versionFindings
  • Williams, R. J. Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine Learning, 8:229–256, 1992.
    Google ScholarLocate open access versionFindings
  • Xu, X., Ba, J., Kiros, R., Courville, A., Salakhutdinov, R., Zemel, R., and Bengio, Y. Show, attend and tell: Neural image caption generation with visual attention. In ICML, 2015.
    Google ScholarLocate open access versionFindings
  • Zaremba, W. and Sutskever, I. Reinforcement learning neural turing machines. Technical report, 2015.
    Google ScholarFindings
Your rating :
0

 

Tags
Comments