Learning bilingual word embeddings with (almost) no bilingual data

ACL, pp. 451-462, 2017.

Cited by: 225|Bibtex|Views32|Links
EI
Keywords:
multilingual wordbilingual evidencecomputational linguisticsbilingual datumlittle bilingual evidenceMore(13+)
Wei bo:
Our experiments on bilingual lexicon induction and crosslingual word similarity show that our method is able to learn high quality bilingual embeddings from as little bilingual evidence as a 25 word dictionary or an automatically generated list of numerals, obtaining results that...

Abstract:

Most methods to learn bilingual word embeddings rely on large parallel corpora, which is difficult to obtain for most language pairs. This has motivated an active research line to relax this requirement, with methods that use document-aligned corpora or bilingual dictionaries of a few thousand words instead. In this work, we further reduc...More

Code:

Data:

Introduction
  • Most methods to learn these multilingual word embeddings make use of large parallel corpora (Gouws et al, 2015; Luong et al, 2015), but there have been several proposals to relax this requirement, given its scarcity in most language pairs.
  • A possible relaxation is to use document-aligned or label-aligned comparable corpora (Søgaard et al., 2015; Vulicand Moens, 2016; Mogadala and Rettinger, 2016), but large amounts of such corpora are not always available for some language pairs.
  • Dictionaries of that size are not readily available for many language pairs, specially those involving less-resourced languages
Highlights
  • Multilingual word embeddings have attracted a lot of attention in recent times
  • An alternative approach that we follow here is to independently train the embeddings for each language on monolingual corpora, and learn a linear transformation to map the embeddings from one space into the other by minimizing the distances in a bilingual dictionary, usually in the range of a few thousand entries (Mikolov et al, 2013a; Artetxe et al, 2016)
  • We propose a simple self-learning framework to learn bilingual word embedding mappings in combination with any embedding mapping and dictionary induction technique
  • Our experiments on bilingual lexicon induction and crosslingual word similarity show that our method is able to learn high quality bilingual embeddings from as little bilingual evidence as a 25 word dictionary or an automatically generated list of numerals, obtaining results that are competitive with state-of-the-art systems using much richer bilingual resources like larger dictionaries or parallel corpora
  • In spite of its simplicity, a more detailed analysis shows that our method is implicitly optimizing a meaningful objective function that is independent from any bilingual data which, with a better optimization method, might allow to learn bilingual word embeddings in a completely unsupervised manner
Methods
  • Method q The authors' method

    Artetxe et al (2016)

    Xing et al (2015)

    Zhang et al (2016)

    Mikolov et al (2013a)

    Seed dictionary size

    It might seem somehow surprising at first that, as seen in the previous section, the simple selflearning approach is able to learn high quality bilingual embeddings from small seed dictionaries instead of falling in degenerated solutions.
  • Mikolov et al (2013a).
  • It might seem somehow surprising at first that, as seen in the previous section, the simple selflearning approach is able to learn high quality bilingual embeddings from small seed dictionaries instead of falling in degenerated solutions.
  • The authors argue that, for the embedding mapping and dictionary induction methods described in Section 3, the proposed selflearning framework is implicitly solving the following global optimization problem8:.
Conclusion
  • Conclusions and future work

    In this work, the authors propose a simple self-learning framework to learn bilingual word embedding mappings in combination with any embedding mapping and dictionary induction technique.
  • The authors' experiments on bilingual lexicon induction and crosslingual word similarity show that the method is able to learn high quality bilingual embeddings from as little bilingual evidence as a 25 word dictionary or an automatically generated list of numerals, obtaining results that are competitive with state-of-the-art systems using much richer bilingual resources like larger dictionaries or parallel corpora.
  • The authors would like to delve deeper into this direction and fine-tune the method so it can reliably learn high quality bilingual word embeddings without any bilingual evidence at all.
  • The authors would like to apply the model in the decipherment scenario (Dou et al, 2015)
Summary
  • Introduction:

    Most methods to learn these multilingual word embeddings make use of large parallel corpora (Gouws et al, 2015; Luong et al, 2015), but there have been several proposals to relax this requirement, given its scarcity in most language pairs.
  • A possible relaxation is to use document-aligned or label-aligned comparable corpora (Søgaard et al., 2015; Vulicand Moens, 2016; Mogadala and Rettinger, 2016), but large amounts of such corpora are not always available for some language pairs.
  • Dictionaries of that size are not readily available for many language pairs, specially those involving less-resourced languages
  • Methods:

    Method q The authors' method

    Artetxe et al (2016)

    Xing et al (2015)

    Zhang et al (2016)

    Mikolov et al (2013a)

    Seed dictionary size

    It might seem somehow surprising at first that, as seen in the previous section, the simple selflearning approach is able to learn high quality bilingual embeddings from small seed dictionaries instead of falling in degenerated solutions.
  • Mikolov et al (2013a).
  • It might seem somehow surprising at first that, as seen in the previous section, the simple selflearning approach is able to learn high quality bilingual embeddings from small seed dictionaries instead of falling in degenerated solutions.
  • The authors argue that, for the embedding mapping and dictionary induction methods described in Section 3, the proposed selflearning framework is implicitly solving the following global optimization problem8:.
  • Conclusion:

    Conclusions and future work

    In this work, the authors propose a simple self-learning framework to learn bilingual word embedding mappings in combination with any embedding mapping and dictionary induction technique.
  • The authors' experiments on bilingual lexicon induction and crosslingual word similarity show that the method is able to learn high quality bilingual embeddings from as little bilingual evidence as a 25 word dictionary or an automatically generated list of numerals, obtaining results that are competitive with state-of-the-art systems using much richer bilingual resources like larger dictionaries or parallel corpora.
  • The authors would like to delve deeper into this direction and fine-tune the method so it can reliably learn high quality bilingual word embeddings without any bilingual evidence at all.
  • The authors would like to apply the model in the decipherment scenario (Dou et al, 2015)
Tables
  • Table1: Accuracy (%) on bilingual lexicon induction for different seed dictionaries tained with the 5,000 entry, 25 entry and the numerals dictionaries for all the 3 language pairs are given in Table 1
  • Table2: Spearman correlations on English-Italian and English-German crosslingual word similarity they have any target language word nearby, making the optimization value small. In contrast, a good solution would map source language words close to their translation equivalents in the target language space, and they would thus have their corresponding embeddings nearby, making the optimization value large. While it is certainly possible to build degenerated solutions that take high optimization values for small subsets of the vocabulary, we think that the structural similarity between independently trained embedding spaces in different languages is strong enough that optimizing this function yields to meaningful bilingual mappings when the size of the vocabulary is much larger than the dimensionality of the embeddings
Download tables as Excel
Related work
  • We will first focus on bilingual embedding mappings, which are the basis of our proposals, and then on other unsupervised and weakly supervised methods to learn bilingual word embeddings.

    2.1 Bilingual embedding mappings
Funding
  • We thank the anonymous reviewers for their insightful comments and Flavio Merenda for his help with the error analysis. This research was partially supported by a Google Faculty Award, the Spanish MINECO (TUNER TIN2015-65308-C5-1-R, MUSTER PCIN-2015-226 and TADEEP TIN2015-70214-P, cofunded by EU FEDER), the Basque Government (MODELA KK-2016/00082) and the UPV/EHU (excellence research group)
  • Mikel Artetxe enjoys a doctoral grant from the Spanish MECD
Reference
  • Mikel Artetxe, Gorka Labaka, and Eneko Agirre. 2016. Learning principled bilingual mappings of word embeddings while preserving monolingual invariance. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, Austin, Texas, pages 2289–2294. https://aclweb.org/anthology/D16-1250.
    Locate open access versionFindings
  • Marco Baroni, Silvia Bernardini, Adriano Ferraresi, and Eros Zanchetta. 2009. The WaCky wide web: a collection of very large linguistically processed web-crawled corpora. Language resources and evaluation 43(3):209–226.
    Google ScholarLocate open access versionFindings
  • Jose Camacho-Collados, Mohammad Taher Pilehvar, and Roberto Navigli. 2015. A framework for the construction of monolingual and cross-lingual word similarity datasets. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 2: Short Papers). Association for Computational Linguistics, Beijing, China, pages 1–7. http://www.aclweb.org/anthology/P15-2001.
    Locate open access versionFindings
  • Hailong Cao, Tiejun Zhao, Shu Zhang, and Yao Meng. 2016. A distribution-based model to learn bilingual word embeddings. In Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: Technical Papers. The COLING 2016 Organizing Committee, Osaka, Japan, pages 1818–1827. http://aclweb.org/anthology/C16-1171.
    Locate open access versionFindings
  • Sarath Chandar A P, Stanislas Lauly, Hugo Larochelle, Mitesh Khapra, Balaraman Ravindran, Vikas C Raykar, and Amrita Saha. 2014. An autoencoder approach to learning bilingual word representations. In Advances in Neural Information Processing Systems 27, Curran Associates, Inc., pages 1853–1861. http://papers.nips.cc/paper/5270-anautoencoder-approach-to-learning-bilingual-wordrepresentations.pdf.
    Locate open access versionFindings
  • Georgiana Dinu, Angeliki Lazaridou, and Marco Baroni. 2015. Improving zero-shot learning by mitigating the hubness problem. In Proceedings of the 3rd International Conference on Learning Representations (ICLR 2015), workshop track.
    Google ScholarLocate open access versionFindings
  • Qing Dou, Ashish Vaswani, Kevin Knight, and Chris Dyer. 2015. Unifying bayesian inference and vector space models for improved decipherment. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers). Association for Computational Linguistics, Beijing, China, pages 836–845. http://www.aclweb.org/anthology/P15-1081.
    Locate open access versionFindings
  • Antonio Valerio Miceli Barone. 2016. Towards crosslingual distributed representations without parallel text trained with adversarial autoencoders. In Proceedings of the 1st Workshop on Representation Learning for NLP. Association for Computational Linguistics, Berlin, Germany, pages 121–126. http://anthology.aclweb.org/W16-1614.
    Locate open access versionFindings
  • Tomas Mikolov, Quoc V Le, and Ilya Sutskever. 2013a. Exploiting similarities among languages for machine translation. arXiv preprint arXiv:1309.4168.
    Findings
  • Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. 2013b. Distributed representations of words and phrases and their compositionality. In Advances in Neural Information Processing Systems 26, Curran Associates, Inc., pages 3111–3119. http://papers.nips.cc/paper/5021distributed-representations-of-words-and-phrasesand-their-compositionality.pdf.
    Locate open access versionFindings
  • Aditya Mogadala and Achim Rettinger. 2016. Bilingual word embeddings from parallel and nonparallel corpora for cross-language text classification. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Association for Computational Linguistics, San Diego, California, pages 692–702. http://www.aclweb.org/anthology/N16-1083.
    Locate open access versionFindings
  • Yves Peirsman and Sebastian Pado. 2010. Crosslingual induction of selectional preferences with bilingual vector spaces. In Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics. Association for Computational Linguistics, Los Angeles, California, pages 921– 929. http://www.aclweb.org/anthology/N10-1135.
    Locate open access versionFindings
  • Samuel L. Smith, David H.P. Turban, Steven Hamblin, and Nils Y. Hammerla. 2017. Offline bilingual word vectors, orthogonal transformations and the inverted softmax. In Proceedings of the 5th International Conference on Learning Representations (ICLR 2017), conference track.
    Google ScholarLocate open access versionFindings
  • Chen-Tse Tsai and Dan Roth. 2016. Cross-lingual wikification using multilingual embeddings. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Association for Computational Linguistics, San Diego, California, pages 589–598. http://www.aclweb.org/anthology/N16-1072.
    Locate open access versionFindings
  • Shyam Upadhyay, Manaal Faruqui, Chris Dyer, and Dan Roth. 2016. Cross-lingual models of word embeddings: An empirical comparison. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Association for Computational Linguistics, Berlin, Germany, pages 1661–1670. http://www.aclweb.org/anthology/P16-1157.
    Locate open access versionFindings
  • Ivan Vulicand Anna Korhonen. 20On the role of seed lexicons in learning bilingual word embeddings. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Association for Computational Linguistics, Berlin, Germany, pages 247–257. http://www.aclweb.org/anthology/P16-1024.
    Locate open access versionFindings
  • Ivan Vulicand Marie-Francine Moens. 2013. A study on bootstrapping bilingual vector spaces from nonparallel data (and nothing else). In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, Seattle, Washington, USA, pages 1613–1624. http://www.aclweb.org/anthology/D131168.
    Locate open access versionFindings
  • Ivan Vulicand Marie-Francine Moens. 2016. Bilingual distributed word representations from documentaligned comparable data. Journal of Artificial Intelligence Research 55(1):953–994.
    Google ScholarLocate open access versionFindings
  • Min Xiao and Yuhong Guo. 2014. Distributed word representation learning for cross-lingual dependency parsing. In Proceedings of the Eighteenth Conference on Computational Natural Language Learning. Association for Computational Linguistics, Ann Arbor, Michigan, pages 119–129. http://www.aclweb.org/anthology/W14-1613.
    Locate open access versionFindings
  • Anders Søgaard, Zeljko Agic, Hector Martınez Alonso, Barbara Plank, Bernd Bohnet, and Anders Johannsen. 2015. Inverted indexing for crosslingual NLP. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers). Association for Computational Linguistics, Beijing, China, pages 1713– 1722. http://www.aclweb.org/anthology/P15-1165.
    Locate open access versionFindings
  • Jorg Tiedemann. 2012. Parallel data, tools and interfaces in OPUS. In Proceedings of the Eight International Conference on Language Resources and Evaluation (LREC’12). European Language Resources Association (ELRA), Istanbul, Turkey.
    Google ScholarLocate open access versionFindings
  • Chao Xing, Dong Wang, Chao Liu, and Yiye Lin. 2015. Normalized word embedding and orthogonal transform for bilingual word translation. In Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Association for Computational Linguistics, Denver, Colorado, pages 1006–1011. http://www.aclweb.org/anthology/N15-1104.
    Locate open access versionFindings
  • Yuan Zhang, David Gaddy, Regina Barzilay, and Tommi Jaakkola. 2016. Ten pairs to tag – multilingual pos tagging via coarse mapping between embeddings. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language
    Google ScholarLocate open access versionFindings
  • Technologies. Association for Computational Linguistics, San Diego, California, pages 1307–1317. http://www.aclweb.org/anthology/N16-1156. Kai Zhao, Hany Hassan, and Michael Auli.2015. Learning translation models from monolingual continuous representations. In Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Association for Computational Linguistics, Denver, Colorado, pages 1527– 1536.http://www.aclweb.org/anthology/N15-1176.
    Locate open access versionFindings
  • Will Y. Zou, Richard Socher, Daniel Cer, and Christopher D. Manning. 2013. Bilingual word embeddings for phrase-based machine translation. In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, Seattle, Washington, USA, pages 1393–1398. http://www.aclweb.org/anthology/D13-1141.
    Locate open access versionFindings
Your rating :
0

 

Tags
Comments