Improving Neural Machine Translation Models with Monolingual Data

meeting of the association for computational linguistics, 2016.

Cited by: 820|Bibtex|Views72|Links
EI
Keywords:
art performancetranslation unitIWSLTtraining datumlanguage modelMore(14+)
Wei bo:
Like for the reverse translation direction, we see substantial improvements from adding monolingual training data with synthetic source sentences, which is substantially bigger than the improvement observed with deep fusion; our ensemble outperforms the previous state of the art ...

Abstract:

Neural Machine Translation (NMT) has obtained state-of-the art performance for several language pairs, while only using parallel data for training. Target-side monolingual data plays an important role in boosting fluency for phrase-based statistical machine translation, and we investigate the use of monolingual data for NMT. In contrast t...More

Code:

Data:

Introduction
  • Neural Machine Translation (NMT) has obtained state-of-the art performance for several language pairs, while only using parallel data for training.
  • Target-side monolingual data plays an important role in boosting fluency for phrase-based statisti-.
  • Language models trained on monolingual data have played a central role in statistical machine translation since the first IBM models (Brown et al, 1990).
  • The amount of available monolingual data in the target language typically far exceeds the amount of parallel data, and models typically improve when trained on more data, or data more similar to the translation task
Highlights
  • Neural Machine Translation (NMT) has obtained state-of-the art performance for several language pairs, while only using parallel data for training
  • We find that mixing parallel training data with monolingual data with a dummy source side in a ratio of 1-1 improves quality by 0.4–0.5 BLEU for the single system, 1 BLEU for the ensemble
  • Like for the reverse translation direction, we see substantial improvements (3.6–3.7 BLEU) from adding monolingual training data with synthetic source sentences, which is substantially bigger than the improvement observed with deep fusion (Gülçehre et al, 2015); our ensemble outperforms the previous state of the art on newstest2015 by 2.3 BLEU
  • In phrase-based SMT, we find that the use of back-translated training data has a moderate positive effect on the WMT test sets (+0.7 BLEU), but not on the IWSLT test sets
  • As a proxy to sentence-level fluency, we investigate word-level fluency, words produced as sequences of subword units, and whether Neural Machine Translation systems trained with additional monolingual data produce more natural words
  • We propose two simple methods to use monolingual training data during training of Neural Machine Translation systems, with no changes to the network architecture
Results
  • 4.2.1 English→German WMT 15

    Table 3 shows English→German results with WMT training and test data.
  • Mixing the original parallel corpus with parallelsynth gives some improvement over the baseline (1.7 BLEU on average), but the novel monolingual training data (Gigawordmono) gives higher improvements, despite being out-of-domain in relation to the test sets.
  • In phrase-based SMT, the authors find that the use of back-translated training data has a moderate positive effect on the WMT test sets (+0.7 BLEU), but not on the IWSLT test sets
  • This is in line with the expectation that the main effect of back-translated data for phrasebased SMT is domain adaptation (Bertoldi and Federico, 2009).
  • Both the WMT test sets and the News Crawl corpora which the authors used as monolingual data come from the same source, a web crawl of newspaper articles.11 In contrast, News Crawl is out-of-domain for the IWSLT test sets.
Conclusion
  • The authors propose two simple methods to use monolingual training data during training of NMT systems, with no changes to the network architecture.
  • Providing training examples with dummy source context was successful to some extent, but the authors achieve substantial gains in all tasks, and new SOTA results, via back-translation of monolingual target data into the source language, and treating this synthetic data as additional training data.
  • The authors show that small amounts of indomain monolingual data, back-translated into the source language, can be effectively used for domain adaptation.
  • The authors identified domain adaptation effects, a reduction of overfitting, and improved fluency as reasons for the effectiveness of using monolingual data for training.
  • It is conceivable that larger synthetic data sets, or data sets obtained via data selection, will provide bigger performance benefits
Summary
  • Introduction:

    Neural Machine Translation (NMT) has obtained state-of-the art performance for several language pairs, while only using parallel data for training.
  • Target-side monolingual data plays an important role in boosting fluency for phrase-based statisti-.
  • Language models trained on monolingual data have played a central role in statistical machine translation since the first IBM models (Brown et al, 1990).
  • The amount of available monolingual data in the target language typically far exceeds the amount of parallel data, and models typically improve when trained on more data, or data more similar to the translation task
  • Results:

    4.2.1 English→German WMT 15

    Table 3 shows English→German results with WMT training and test data.
  • Mixing the original parallel corpus with parallelsynth gives some improvement over the baseline (1.7 BLEU on average), but the novel monolingual training data (Gigawordmono) gives higher improvements, despite being out-of-domain in relation to the test sets.
  • In phrase-based SMT, the authors find that the use of back-translated training data has a moderate positive effect on the WMT test sets (+0.7 BLEU), but not on the IWSLT test sets
  • This is in line with the expectation that the main effect of back-translated data for phrasebased SMT is domain adaptation (Bertoldi and Federico, 2009).
  • Both the WMT test sets and the News Crawl corpora which the authors used as monolingual data come from the same source, a web crawl of newspaper articles.11 In contrast, News Crawl is out-of-domain for the IWSLT test sets.
  • Conclusion:

    The authors propose two simple methods to use monolingual training data during training of NMT systems, with no changes to the network architecture.
  • Providing training examples with dummy source context was successful to some extent, but the authors achieve substantial gains in all tasks, and new SOTA results, via back-translation of monolingual target data into the source language, and treating this synthetic data as additional training data.
  • The authors show that small amounts of indomain monolingual data, back-translated into the source language, can be effectively used for domain adaptation.
  • The authors identified domain adaptation effects, a reduction of overfitting, and improved fluency as reasons for the effectiveness of using monolingual data for training.
  • It is conceivable that larger synthetic data sets, or data sets obtained via data selection, will provide bigger performance benefits
Tables
  • Table1: English↔German training data
  • Table2: Turkish→English training data
  • Table3: English→German translation performance (BLEU) on WMT training/test sets. Ens-4: ensemble of 4 models. Number of training instances varies due to differences in training time and speed
  • Table4: English→German translation performance (BLEU) on IWSLT test sets (TED talks). Single models
  • Table5: German→English translation performance (BLEU) on WMT training/test sets (newstest2014; newstest2015)
  • Table6: Turkish→English translation performance (tokenized BLEU) on IWSLT test sets (TED talks). Single models. Number of training instances varies due to early stopping
  • Table7: English→German translation performance (BLEU) on WMT training/test sets (newstest2014; newstest2015). Systems differ in how the synthetic training data is obtained. Ensembles of 4 models (unless specified otherwise)
  • Table8: Phrase-based SMT results
  • Table9: Number of words in system output that do not occur in parallel training data (countref = 1168), and proportion that is attested in data, or natural according to native speaker. English→German; newstest2015; ensemble systems
Download tables as Excel
Related work
  • To our knowledge, the integration of monolingual data for pure neural machine translation architectures was first investigated by (Gülçehre et al, 2015), who train monolingual language models independently, and then integrate them during decoding through rescoring of the beam (shallow fusion), or by adding the recurrent hidden state of the language model to the decoder state of the encoder-decoder network, with an additional controller mechanism that controls the magnitude of the LM signal (deep fusion). In deep fusion, the controller parameters and output parameters are tuned on further parallel training data, but the language model parameters are fixed during the finetuning stage. Jean et al (2015b) also report on experiments with reranking of NMT output with a 5-gram language model, but improvements are small (between 0.1–0.5 BLEU).

    The production of synthetic parallel texts bears resemblance to data augmentation techniques used in computer vision, where datasets are often augmented with rotated, scaled, or otherwise distorted variants of the (limited) training set (Rowley et al, 1996).

    Another similar avenue of research is selftraining (McClosky et al, 2006; Schwenk, 2008). The main difference is that self-training typically refers to scenario where the training set is enhanced with training instances with artificially produced output labels, whereas we start with human-produced output (i.e. the translation), and artificially produce an input. We expect that this is more robust towards noise in the automatic translation. Improving NMT with monolingual source data, following similar work on phrasebased SMT (Schwenk, 2008), remains possible future work.
Funding
  • This project received funding from the European Union’s Horizon 2020 research and innovation programme under grant agreement 645452 (QT21)
Study subjects and analysis
sentence pairs: 320000
We use data provided for the IWSLT 14 machine translation track (Cettolo et al, 2014), namely the WIT3 parallel corpus (Cettolo et al, 2012), which consists of TED talks, and the SETimes corpus (Tyers and Alperen, 2010).6. After removal of sentence pairs which contain empty lines or lines with a length ratio above 9, we retain 320 000 sentence pairs of training data. For the experiments with monolingual training data, we use the English LDC Gigaword corpus (Fifth Edition)

data sets: 15
4.2.3 German→English WMT 15. Results for German→English on the WMT 15 data sets are shown in Table 5. Like for the reverse translation direction, we see substantial improvements (3.6–3.7 BLEU) from adding monolingual training data with synthetic source sentences, which is substantially bigger than the improvement observed with deep fusion (Gülçehre et al, 2015); our ensemble outperforms the previous state of the art on newstest2015 by 2.3 BLEU

monolingual data sets: 3
For comparability, we measure training set crossentropy for all models on the same random sample of the parallel training set. We can see that the model trained on only parallel training data quickly overfits, while all three monolingual data sets (parallelsynth, Gigawordmono, or Gigawordsynth) delay overfitting, and give better perplexity on the development set. The best development set cross-entropy is reached by Gigawordsynth

data: 100
We also count how many of them are attested in the full monolingual corpus or the reference translation, which we all consider ‘natural’. Additionally, the main authors, a native speaker of German, annotated a random subset (n = 100) of unattested words of each system according to their naturalness13, distinguishing between natural German words (or names) such as Literatur|klassen ‘literature classes’, and nonsensical ones such as *As|best|atten (a missspelling of Astbestmatten ‘asbestos mats’). In the results (Table 9), we see that the systems trained with additional monolingual or synthetic data have a higher proportion of novel words attested in the non-parallel data, and a higher proportion that is deemed natural by our annotator

Reference
  • Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. 2015. Neural Machine Translation by Jointly Learning to Align and Translate. In Proceedings of the International Conference on Learning Representations (ICLR).
    Google ScholarLocate open access versionFindings
  • Nicola Bertoldi and Marcello Federico. 2009. Domain adaptation for statistical machine translation with monolingual resources. In Proceedings of the Fourth Workshop on Statistical Machine Translation
    Google ScholarLocate open access versionFindings
  • StatMT 09. Association for Computational Linguistics.
    Google ScholarFindings
  • Ondrej Bojar, Rajen Chatterjee, Christian Federmann, Barry Haddow, Matthias Huck, Chris Hokamp, Philipp Koehn, Varvara Logacheva, Christof Monz, Matteo Negri, Matt Post, Carolina Scarton, Lucia Specia, and Marco Turchi. 2015. Findings of the 2015 Workshop on Statistical Machine Translation. In Proceedings of the Tenth Workshop on Statistical Machine Translation, pages 1–46, Lisbon, Portugal. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • P.F. Brown, S.A. Della Pietra, V.J. Della Pietra, F. Jelinek, J.D. Lafferty, R.L. Mercer, and P.S. Roossin. 1990. A Statistical Approach to Machine Translation. Computational Linguistics, 16(2):79–85.
    Google ScholarLocate open access versionFindings
  • Mauro Cettolo, Christian Girardi, and Marcello Federico. 2012. WIT3: Web Inventory of Transcribed and Translated Talks. In Proceedings of the 16th Conference of the European Association for Machine Translation (EAMT), pages 261–268, Trento, Italy.
    Google ScholarLocate open access versionFindings
  • Mauro Cettolo, Jan Niehues, Sebastian Stüker, Luisa Bentivogli, and Marcello Federico. 2014. Report on the 11th IWSLT Evaluation Campaign, IWSLT 2014. In Proceedings of the 11th Workshop on Spoken Language Translation, pages 2–16, Lake Tahoe, CA, USA.
    Google ScholarLocate open access versionFindings
  • Kyunghyun Cho, Bart van Merrienboer, Caglar Gulcehre, Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, and Yoshua Bengio. 2014. Learning Phrase Representations using RNN Encoder– Decoder for Statistical Machine Translation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 1724–1734, Doha, Qatar. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Alex Graves. 2011. Practical Variational Inference for Neural Networks. In J. Shawe-Taylor, R.S. Zemel, P.L. Bartlett, F. Pereira, and K.Q. Weinberger, editors, Advances in Neural Information Processing Systems 24, pages 2348–2356. Curran Associates, Inc.
    Google ScholarLocate open access versionFindings
  • Çaglar Gülçehre, Orhan Firat, Kelvin Xu, Kyunghyun Cho, Loïc Barrault, Huei-Chi Lin, Fethi Bougares, Holger Schwenk, and Yoshua Bengio. 2015. On Using Monolingual Corpora in Neural Machine Translation. CoRR, abs/1503.03535.
    Findings
  • Barry Haddow, Matthias Huck, Alexandra Birch, Nikolay Bogoychev, and Philipp Koehn. 2015. The Edinburgh/JHU Phrase-based Machine Translation Systems for WMT 2015. In Proceedings of the Tenth Workshop on Statistical Machine Translation, pages 126–133, Lisbon, Portugal. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Geoffrey E. Hinton, Nitish Srivastava, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. 20Improving neural networks by preventing co-adaptation of feature detectors. CoRR, abs/1207.0580.
    Findings
  • Sébastien Jean, Kyunghyun Cho, Roland Memisevic, and Yoshua Bengio. 2015a. On Using Very Large Target Vocabulary for Neural Machine Translation. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 1–10, Beijing, China. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Sébastien Jean, Orhan Firat, Kyunghyun Cho, Roland Memisevic, and Yoshua Bengio. 2015b. Montreal Neural Machine Translation Systems for WMT’15. In Proceedings of the Tenth Workshop on Statistical Machine Translation, pages 134–140, Lisbon, Portugal. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Philipp Koehn, Hieu Hoang, Alexandra Birch, Chris Callison-Burch, Marcello Federico, Nicola Bertoldi, Brooke Cowan, Wade Shen, Christine Moran, Richard Zens, Chris Dyer, Ondrej Bojar, Alexandra Constantin, and Evan Herbst. 2007. Moses: Open Source Toolkit for Statistical Machine Translation. In Proceedings of the ACL-2007 Demo and Poster Sessions, pages 177–180, Prague, Czech Republic. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Patrik Lambert, Holger Schwenk, Christophe Servan, and Sadaf Abdul-Rauf. 2011. Investigations on Translation Model Adaptation Using Monolingual Data. In Proceedings of the Sixth Workshop on Statistical Machine Translation, pages 284–293, Edinburgh, Scotland. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Minh-Thang Luong and Christopher D. Manning. 2015. Stanford Neural Machine Translation Systems for Spoken Language Domains. In Proceedings of the International Workshop on Spoken Language Translation 2015, Da Nang, Vietnam.
    Google ScholarLocate open access versionFindings
  • Thang Luong, Hieu Pham, and Christopher D. Manning. 2015. Effective Approaches to Attentionbased Neural Machine Translation. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, pages 1412– 1421, Lisbon, Portugal. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • David McClosky, Eugene Charniak, and Mark Johnson. 2006. Effective Self-training for Parsing. In Proceedings of the Main Conference on Human Language Technology Conference of the North American Chapter of the Association of Computational Linguistics, HLT-NAACL ’06, pages 152–159, New York. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Henry Rowley, Shumeet Baluja, and Takeo Kanade. 1996. Neural Network-Based Face Detection. In Computer Vision and Pattern Recognition ’96.
    Google ScholarLocate open access versionFindings
  • Hasim Sak, Tunga Güngör, and Murat Saraçlar. 2007. Morphological Disambiguation of Turkish Text with Perceptron Algorithm. In CICLing 2007, pages 107–118.
    Google ScholarLocate open access versionFindings
  • Holger Schwenk. 2008. Investigations on Large-Scale Lightly-Supervised Training for Statistical Machine Translation. In International Workshop on Spoken Language Translation, pages 182–189.
    Google ScholarLocate open access versionFindings
  • Rico Sennrich and Barry Haddow. 2015. A Joint Dependency Model of Morphological and Syntactic Structure for Statistical Machine Translation. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, pages 2081–2087, Lisbon, Portugal. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Rico Sennrich, Barry Haddow, and Alexandra Birch. 2016. Neural Machine Translation of Rare Words with Subword Units. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (ACL 2016), Berlin, Germany.
    Google ScholarLocate open access versionFindings
  • Ilya Sutskever, Oriol Vinyals, and Quoc V. Le. 2014. Sequence to Sequence Learning with Neural Networks. In Advances in Neural Information Processing Systems 27: Annual Conference on Neural Information Processing Systems 2014, pages 3104–3112, Montreal, Quebec, Canada.
    Google ScholarLocate open access versionFindings
  • Alex Ter-Sarkisov, Holger Schwenk, Fethi Bougares, and Loïc Barrault. 2015. Incremental Adaptation Strategies for Neural Network Language Models. In Proceedings of the 3rd Workshop on Continuous Vector Space Models and their Compositionality, pages 48–56, Beijing, China. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Francis M. Tyers and Murat S. Alperen. 2010. SETimes: A parallel corpus of Balkan languages. In Workshop on Exploitation of multilingual resources and tools for Central and (South) Eastern European Languages at the Language Resources and Evaluation Conference, pages 1–5.
    Google ScholarLocate open access versionFindings
Your rating :
0

 

Tags
Comments