Achieving Open Vocabulary Neural Machine Translation with Hybrid Word-Character Models

meeting of the association for computational linguistics, 2016.

Cited by: 285|Bibtex|Views71|Links
EI
Keywords:
character levelrare wordopen vocabularylong short-term memoryNIPSMore(9+)
Weibo:
Word-level models are fast to train and offer high-quality translation; whereas, character-level models help achieve the goal of open vocabulary Neural Machine Translation

Abstract:

Nearly all previous work on neural machine translation (NMT) has used quite restricted vocabularies, perhaps with a subsequent method to patch in unknown words. This paper presents a novel word-character solution to achieving open vocabulary NMT. We build hybrid systems that translate mostly at the word level and consult the character com...More

Code:

Data:

Introduction
  • Neural Machine Translation (NMT) is a simple new architecture for getting machines to translate.
  • Neural machine translation aims to directly model the conditional probability p(y|x) of translating a source sentence, x1, .
  • Based on that source representation, the decoder generates a translation, one target word at a time, and decomposes the log conditional probability as:.
  • All the models utilize the deep multi-layer architecture with LSTM as the recurrent unit; detailed formulations are in (Zaremba et al, 2014)
Highlights
  • Neural Machine Translation (NMT) is a simple new architecture for getting machines to translate
  • Neural Machine Translation is a single deep neural network that is trained end-to-end with several advantages such as simplicity and generalization
  • We propose a novel hybrid architecture for Neural Machine Translation that translates mostly at the word level and consults the character components for rare words when necessary
  • We demonstrate at scale that on the WMT’15 English to Czech translation task, such a hybrid approach provides an additional boost of +2.1−11.4 BLEU points over models that already handle unknown words
  • Word-level models are fast to train and offer high-quality translation; whereas, character-level models help achieve the goal of open vocabulary Neural Machine Translation
  • Our analysis has shown that our model has the ability to not only generate well-formed words for Czech, a highly inflected language with an enormous and complex vocabulary, but build accurate representations for English source words
Methods
  • The authors evaluate the effectiveness of the models on the publicly available WMT’15 translation task from English into Czech with newstest2013 (3000 sentences) as a development set and newstest2015 (2656 sentences) as a test set.
  • Two metrics are used: case-sensitive NIST BLEU (Papineni et al, 2002) and chrF3 (Popovic, 2015).3
  • The latter measures the amounts of overlapping character ngrams and has been argued to be a better metric for translation tasks out of English.
  • Czech possesses an enormously large vocabulary and is a challenging language to translate into.
  • This language pair has a large amount of training data, so the authors can evaluate at scale.
  • Though the techniques are language independent, it is easier for them to work with Czech since Czech uses the Latin alphabet with some diacritics
Results
  • The authors compare the models with several strong systems. These include the winning entry in WMT’15, which was trained on a much larger amount of data, 52.6M parallel and 393.0M monolingual sentences (Bojar and Tamchyna, 2015).6 In contrast, the authors merely use the provided parallel corpus of 15.8M sentences.
  • The authors compare the models with several strong systems
  • These include the winning entry in WMT’15, which was trained on a much larger amount of data, 52.6M parallel and 393.0M monolingual sentences (Bojar and Tamchyna, 2015).6.
  • As shown in Table 2, for a purely word-based approach, the single NMT model outperforms the best single model in (Jean et al, 2015b) by +1.8 points despite using a smaller vocabulary of only 50K words versus 200K words.
  • The authors' ensemble system (e) slightly outperforms the best previous NMT system with 18.4 BLEU
Conclusion
  • The authors have proposed a novel hybrid architecture that combines the strength of both word- and character-based models.
  • Word-level models are fast to train and offer high-quality translation; whereas, character-level models help achieve the goal of open vocabulary NMT.
  • The authors have demonstrated these two aspects through the experimental results and translation examples.
  • The authors' best hybrid model has surpassed the performance of both the best word-based NMT system and the best non-neural model to establish a new state-of-the-art result for English-Czech translation in WMT’15 with 20.7 BLEU.
  • The authors' analysis has shown that the model has the ability to not only generate well-formed words for Czech, a highly inflected language with an enormous and complex vocabulary, but build accurate representations for English source words
Summary
  • Introduction:

    Neural Machine Translation (NMT) is a simple new architecture for getting machines to translate.
  • Neural machine translation aims to directly model the conditional probability p(y|x) of translating a source sentence, x1, .
  • Based on that source representation, the decoder generates a translation, one target word at a time, and decomposes the log conditional probability as:.
  • All the models utilize the deep multi-layer architecture with LSTM as the recurrent unit; detailed formulations are in (Zaremba et al, 2014)
  • Methods:

    The authors evaluate the effectiveness of the models on the publicly available WMT’15 translation task from English into Czech with newstest2013 (3000 sentences) as a development set and newstest2015 (2656 sentences) as a test set.
  • Two metrics are used: case-sensitive NIST BLEU (Papineni et al, 2002) and chrF3 (Popovic, 2015).3
  • The latter measures the amounts of overlapping character ngrams and has been argued to be a better metric for translation tasks out of English.
  • Czech possesses an enormously large vocabulary and is a challenging language to translate into.
  • This language pair has a large amount of training data, so the authors can evaluate at scale.
  • Though the techniques are language independent, it is easier for them to work with Czech since Czech uses the Latin alphabet with some diacritics
  • Results:

    The authors compare the models with several strong systems. These include the winning entry in WMT’15, which was trained on a much larger amount of data, 52.6M parallel and 393.0M monolingual sentences (Bojar and Tamchyna, 2015).6 In contrast, the authors merely use the provided parallel corpus of 15.8M sentences.
  • The authors compare the models with several strong systems
  • These include the winning entry in WMT’15, which was trained on a much larger amount of data, 52.6M parallel and 393.0M monolingual sentences (Bojar and Tamchyna, 2015).6.
  • As shown in Table 2, for a purely word-based approach, the single NMT model outperforms the best single model in (Jean et al, 2015b) by +1.8 points despite using a smaller vocabulary of only 50K words versus 200K words.
  • The authors' ensemble system (e) slightly outperforms the best previous NMT system with 18.4 BLEU
  • Conclusion:

    The authors have proposed a novel hybrid architecture that combines the strength of both word- and character-based models.
  • Word-level models are fast to train and offer high-quality translation; whereas, character-level models help achieve the goal of open vocabulary NMT.
  • The authors have demonstrated these two aspects through the experimental results and translation examples.
  • The authors' best hybrid model has surpassed the performance of both the best word-based NMT system and the best non-neural model to establish a new state-of-the-art result for English-Czech translation in WMT’15 with 20.7 BLEU.
  • The authors' analysis has shown that the model has the ability to not only generate well-formed words for Czech, a highly inflected language with an enormous and complex vocabulary, but build accurate representations for English source words
Tables
  • Table1: WMT’15 English-Czech data – shown are various statistics of our training data such as sentence, token (word and character counts), as well as type (sizes of the word and character vocabularies). We show in addition the amount of words in a vocabulary expressed by a list of 200 characters found in frequent words
  • Table2: WMT’15 English-Czech results – shown are the vocabulary sizes, perplexities, BLEU, and chrF3 scores of various systems on newstest2015. Perplexities are listed under two categories, word (w) and character (c). Best and important results per metric are highlighed
  • Table3: Word similarity task – shown are Spearman’s correlation ρ on the Rare Word dataset of various models (with different vocab sizes |V |)
  • Table4: Sample translations on newstest2015 – for each example, we show the source, human translation, and translations of the following NMT systems: word model (d), char model (g), and hybrid model (k). We show the translations before replacing <unk> tokens (if any) for the word-based and hybrid models. The following formats are used to highlight correct, wrong, and close translation segments
Download tables as Excel
Related work
  • There has been a recent line of work on end-toend character-based neural models which achieve good results for part-of-speech tagging (dos Santos and Zadrozny, 2014; Ling et al, 2015a), dependency parsing (Ballesteros et al, 2015), text classification (Zhang et al, 2015), speech recognition (Chan et al, 2016; Bahdanau et al, 2016), and language modeling (Kim et al, 2016; Jozefowicz et al, 2016). However, success has not been shown for cross-lingual tasks such as machine translation.1 Sennrich et al (2016) propose to segment words into smaller units and translate just like at the word level, which does not learn to understand relationships among words.

    Our work takes inspiration from (Luong et al, 2013) and (Li et al, 2015). Similar to the former, we build representations for rare words on-the-fly from subword units. However, we utilize recurrent neural networks with characters as the basic units; whereas Luong et al (2013) use recursive neural networks with morphemes as units, which requires existence of a morphological analyzer. In comparison with (Li et al, 2015), our hybrid architecture is also a hierarchical sequence-to-sequence model, but operates at a different granularity level, word-character. In contrast, Li et al (2015) build hierarchical models at the sentence-word level for paragraphs and documents.
Funding
  • This work was partially supported by NSF Award IIS-1514268 and by a gift from Bloomberg L.P
Reference
  • Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. 2015. Neural machine translation by jointly learning to align and translate. In ICLR.
    Google ScholarFindings
  • Dzmitry Bahdanau, Jan Chorowski, Dmitriy Serdyuk, Philemon Brakel, and Yoshua Bengio. 2016. End-to-end attention-based large vocabulary speech recognition. In ICASSP.
    Google ScholarFindings
  • Miguel Ballesteros, Chris Dyer, and Noah A. Smith. 2015. Improved transition-based parsing by modeling characters instead of words with LSTMs. In EMNLP.
    Google ScholarFindings
  • Ondrej Bojar and Ales Tamchyna. 2015. CUNI in WMT15: Chimera Strikes Again. In WMT.
    Google ScholarLocate open access versionFindings
  • William Chan, Navdeep Jaitly, Quoc V. Le, and Oriol Vinyals. 2016.
    Google ScholarFindings
  • Kyunghyun Cho, Bart van Merrienboer, Caglar Gulcehre, Fethi Bougares, Holger Schwenk, and Yoshua Bengio. 2014. Learning phrase representations using RNN encoder-decoder for statistical machine translation. In EMNLP.
    Google ScholarFindings
  • Cícero Nogueira dos Santos and Bianca Zadrozny. 2014. Learning character-level representations for part-of-speech tagging. In ICML.
    Google ScholarFindings
  • Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long short-term memory. 9(8):1735–1780.
    Google ScholarFindings
  • Sébastien Jean, Kyunghyun Cho, Roland Memisevic, and Yoshua Bengio. 2015a. On using very large target vocabulary for neural machine translation. In ACL.
    Google ScholarFindings
  • Sébastien Jean, Orhan Firat, Kyunghyun Cho, Roland Memisevic, and Yoshua Bengio. 2015b. Montreal neural machine translation systems for WMT’15. In WMT.
    Google ScholarFindings
  • Rafal Jozefowicz, Oriol Vinyals, Mike Schuster, Noam Shazeer, and Yonghui Wu. 2016. Exploring the limits of language modeling.
    Google ScholarFindings
  • Nal Kalchbrenner and Phil Blunsom. 2013. Recurrent continuous translation models. In EMNLP.
    Google ScholarFindings
  • Yoon Kim, Yacine Jernite, David Sontag, and Alexander M. Rush. 2016. Character-aware neural language models. In AAAI.
    Google ScholarFindings
  • Jiwei Li, Minh-Thang Luong, and Dan Jurafsky. 2015. A hierarchical neural autoencoder for paragraphs and documents. In ACL.
    Google ScholarFindings
  • Percy Liang, Ben Taskar, and Dan Klein. 2006. Alignment by agreement. In NAACL.
    Google ScholarFindings
  • Wang Ling, Chris Dyer, Alan W. Black, Isabel Trancoso, Ramon Fermandez, Silvio Amir, Luís Marujo, and Tiago Luís. 2015a. Finding function in form: Compositional character models for open vocabulary word representation. In EMNLP.
    Google ScholarFindings
  • Wang Ling, Isabel Trancoso, Chris Dyer, and Alan Black. 2015b. Character-based neural machine translation.
    Google ScholarFindings
  • Minh-Thang Luong and Christopher D. Manning. 2015. Stanford neural machine translation systems for spoken language domain. In IWSLT.
    Google ScholarFindings
  • Minh-Thang Luong, Richard Socher, and Christopher D. Manning. 2013. Better word representations with recursive neural networks for morphology. In CoNLL.
    Google ScholarFindings
  • Minh-Thang Luong, Hieu Pham, and Christopher D. Manning. 2015a. Effective approaches to attentionbased neural machine translation. In EMNLP.
    Google ScholarFindings
  • Minh-Thang Luong, Ilya Sutskever, Quoc V. Le, Oriol Vinyals, and Wojciech Zaremba. 2015b. Addressing the rare word problem in neural machine translation. In ACL.
    Google ScholarFindings
  • Kishore Papineni, Salim Roukos, Todd Ward, and Wei jing Zhu. 2002. Bleu: a method for automatic evaluation of machine translation. In ACL.
    Google ScholarFindings
  • Jeffrey Pennington, Richard Socher, and Christopher D. Manning. 2014. Glove: Global vectors for word representation. In EMNLP.
    Google ScholarFindings
  • Vu Pham, Théodore Bluche, Christopher Kermorvant, and Jérôme Louradour. 2014. Dropout improves recurrent neural networks for handwriting recognition. In ICFHR.
    Google ScholarFindings
  • Maja Popovic. 2015. chrF: character n-gram F-score for automatic MT evaluation. In WMT.
    Google ScholarFindings
  • Rico Sennrich, Barry Haddow, and Alexandra Birch. 2016. Neural machine translation of rare words with subword units. In ACL.
    Google ScholarFindings
  • Ilya Sutskever, Oriol Vinyals, and Quoc V. Le. 2014. Sequence to sequence learning with neural networks. In NIPS.
    Google ScholarFindings
  • Laurens van der Maaten. 2013. Barnes-Hut-SNE. In ICLR.
    Google ScholarFindings
  • Wojciech Zaremba, Ilya Sutskever, and Oriol Vinyals. 2014. Recurrent neural network regularization. abs/1409.2329.
    Google ScholarFindings
  • Xiang Zhang, Junbo Zhao, and Yann LeCun. 2015. Character-level convolutional networks for text classification. In NIPS.
    Google ScholarFindings
Your rating :
0

 

Tags
Comments