Dependency-Based Word Embeddings

ACL, pp. 302-308, 2014.

Cited by: 820|Bibtex|Views39|Links
EI
Keywords:
arbitrary contextgram modelneural network languagedistributional hypothesisgood argumentMore(6+)
Wei bo:
Our software, allowing for experimentation with arbitrary contexts, together with the embeddings described in this paper, are available for download at the authors’ websites

Abstract:

While continuous word embeddings are gaining popularity, current models are based solely on linear contexts. In this work, we generalize the skip-gram model with negative sampling introduced by Mikolov et al. to include arbitrary contexts. In particular, we perform experiments with dependency-based contexts, and show that they produce mar...More

Code:

Data:

Introduction
Highlights
  • Word representation is central to natural language processing
  • Based on the distributional hypothesis, many methods of deriving word representations were explored in the NLP community
  • In Section 5 we show that the SKIPGRAM model does allow for some introspection by querying it for contexts that are “activated by” a target word
  • Syntactic contexts capture different information than bag-of-word contexts, as we demonstrate using the sentence “Australian scientist discovers star with telescope”
  • Our software, allowing for experimentation with arbitrary contexts, together with the embeddings described in this paper, are available for download at the authors’ websites
Methods
  • Experiments and Evaluation

    The authors experiment with 3 training conditions: BOW5, BOW2 and DEPS.
  • The authors modified word2vec to support arbitrary contexts, and to output the context embeddings in addition to the word embeddings.
  • For bag-of-words contexts the authors used the original word2vec implementation, and for syntactic contexts, the authors used the modified version.
  • All tokens were converted to lowercase, and words and contexts that appeared less than 100 times were filtered.
  • This resulted in a vocabulary of about 175,000 words, with over 900,000 distinct syntactic contexts.
  • The authors report results for 300 dimension embeddings, though similar trends were observed with 600 dimensions
Conclusion
  • The authors presented a generalization of the SKIPGRAM embedding model in which the linear bagof-words contexts are replaced with arbitrary ones, and experimented with dependency-based contexts, showing that they produce markedly different kinds of similarities.
  • These results are expected, and follow similar findings in the distributional semantics literature.
  • The authors' software, allowing for experimentation with arbitrary contexts, together with the embeddings described in this paper, are available for download at the authors’ websites
Summary
  • Introduction:

    Word representation is central to natural language processing. The default approach of representing words as discrete and distinct symbols is insufficient for many tasks, and suffers from poor generalization.
  • It has been proposed to represent words as dense vectors that are derived by various training methods inspired from neural-network language modeling (Bengio et al, 2003; Collobert and Weston, 2008; Mnih and Hinton, 2008; Mikolov et al, 2011; Mikolov et al, 2013b)
  • These representations, referred to as “neural embeddings” or “word embeddings”, have been shown to perform well across a variety of tasks (Turian et al, 2010; Collobert et al, 2011; Socher et al, 2011; Al-Rfou et al, 2013)
  • Methods:

    Experiments and Evaluation

    The authors experiment with 3 training conditions: BOW5, BOW2 and DEPS.
  • The authors modified word2vec to support arbitrary contexts, and to output the context embeddings in addition to the word embeddings.
  • For bag-of-words contexts the authors used the original word2vec implementation, and for syntactic contexts, the authors used the modified version.
  • All tokens were converted to lowercase, and words and contexts that appeared less than 100 times were filtered.
  • This resulted in a vocabulary of about 175,000 words, with over 900,000 distinct syntactic contexts.
  • The authors report results for 300 dimension embeddings, though similar trends were observed with 600 dimensions
  • Conclusion:

    The authors presented a generalization of the SKIPGRAM embedding model in which the linear bagof-words contexts are replaced with arbitrary ones, and experimented with dependency-based contexts, showing that they produce markedly different kinds of similarities.
  • These results are expected, and follow similar findings in the distributional semantics literature.
  • The authors' software, allowing for experimentation with arbitrary contexts, together with the embeddings described in this paper, are available for download at the authors’ websites
Tables
  • Table1: Target words and their 5 most similar words, as induced by different embeddings
  • Table2: Words and their top syntactic contexts
Download tables as Excel
Reference
  • Eneko Agirre, Enrique Alfonseca, Keith Hall, Jana Kravalova, Marius Pasca, and Aitor Soroa. 2009. A study on similarity and relatedness using distributional and wordnet-based approaches. In Proceedings of Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the Association for Computational Linguistics, pages 19–27, Boulder, Colorado, June. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Rami Al-Rfou, Bryan Perozzi, and Steven Skiena. 2013. Polyglot: Distributed word representations for multilingual nlp. In Proc. of CoNLL 2013.
    Google ScholarLocate open access versionFindings
  • Marco Baroni and Alessandro Lenci. 2010. Distributional memory: A general framework for corpus-based semantics. Computational Linguistics, 36(4):673–721.
    Google ScholarLocate open access versionFindings
  • Yoshua Bengio, Rejean Ducharme, Pascal Vincent, and Christian Jauvin. 2003. A neural probabilistic language model. Journal of Machine Learning Research, 3:1137–1155.
    Google ScholarLocate open access versionFindings
  • Peter F Brown, Robert L Mercer, Vincent J Della Pietra, and Jenifer C Lai. 1992. Class-based n-gram models of natural. Computational Linguistics, 18(4).
    Google ScholarLocate open access versionFindings
  • John A Bullinaria and Joseph P Levy. 2007. Extracting semantic representations from word co-occurrence statistics: A computational study. Behavior Research Methods, 39(3):510–526.
    Google ScholarLocate open access versionFindings
  • Christine Chiarello, Curt Burgess, Lorie Richards, and Alma Pollock. 1990. Semantic and associative priming in the cerebral hemispheres: Some words do, some words don’t... sometimes, some places. Brain and Language, 38(1):75–104.
    Google ScholarLocate open access versionFindings
  • Raphael Cohen, Yoav Goldberg, and Michael Elhadad. 2012. Domain adaptation of a dependency parser with a class-class selectional preference model. In Proceedings of ACL 2012 Student Research Workshop, pages 43–48, Jeju Island, Korea, July. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Ronan Collobert and Jason Weston. 2008. A unified architecture for natural language processing: Deep neural networks with multitask learning. In Proceedings of the 25th International Conference on Machine Learning, pages 160–167.
    Google ScholarLocate open access versionFindings
  • Ronan Collobert, Jason Weston, Leon Bottou, Michael Karlen, Koray Kavukcuoglu, and Pavel Kuksa. 2011. Natural language processing (almost) from scratch. The Journal of Machine Learning Research, 12:2493–2537.
    Google ScholarLocate open access versionFindings
  • Marie-Catherine de Marneffe and Christopher D. Manning. 2008. The Stanford typed dependencies representation. In Coling 2008: Proceedings of the workshop on Cross-Framework and Cross-Domain Parser Evaluation, pages 1–8, Manchester, UK, August. Coling 2008 Organizing Committee.
    Google ScholarLocate open access versionFindings
  • Lev Finkelstein, Evgeniy Gabrilovich, Yossi Matias, Ehud Rivlin, Zach Solan, Gadi Wolfman, and Eytan Ruppin. 2002. Placing search in context: The concept revisited. ACM Transactions on Information Systems, 20(1):116–131.
    Google ScholarLocate open access versionFindings
  • Yoav Goldberg and Omer Levy. 2014. word2vec explained: deriving mikolov et al.’s negativesampling word-embedding method. arXiv preprint arXiv:1402.3722.
    Findings
  • Yoav Goldberg and Joakim Nivre. 2012. A dynamic oracle for the arc-eager system. In Proc. of COLING 2012.
    Google ScholarLocate open access versionFindings
  • Yoav Goldberg and Joakim Nivre. 2013. Training deterministic parsers with non-deterministic oracles. Transactions of the association for Computational Linguistics, 1.
    Google ScholarFindings
  • Zellig Harris. 1954. Distributional structure. Word, 10(23):146–162.
    Google ScholarLocate open access versionFindings
  • Omer Levy and Yoav Goldberg. 2014. Linguistic regularities in sparse and explicit word representations. In Proceedings of the Eighteenth Conference on Computational Natural Language Learning, Baltimore, Maryland, USA, June. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Dekang Lin. 1998. Automatic retrieval and clustering of similar words. In Proceedings of the 36th Annual Meeting of the Association for Computational Linguistics and 17th International Conference on Computational Linguistics - Volume 2, ACL ’98, pages 768–774, Stroudsburg, PA, USA. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Tomas Mikolov, Stefan Kombrink, Lukas Burget, JH Cernocky, and Sanjeev Khudanpur. 2011. Extensions of recurrent neural network language model. In Acoustics, Speech and Signal Processing (ICASSP), 2011 IEEE International Conference on, pages 5528–5531. IEEE.
    Google ScholarLocate open access versionFindings
  • Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. 2013a. Efficient estimation of word representations in vector space. CoRR, abs/1301.3781.
    Findings
  • Tomas Mikolov, Ilya Sutskever, Kai Chen, Gregory S. Corrado, and Jeffrey Dean. 2013b. Distributed representations of words and phrases and their compositionality. In Advances in Neural Information Processing Systems 26: 27th Annual Conference on Neural Information Processing Systems 2013. Proceedings of a meeting held December 5-8, 2013, Lake Tahoe, Nevada, United States, pages 3111– 3119.
    Google ScholarLocate open access versionFindings
  • Tomas Mikolov, Wen-tau Yih, and Geoffrey Zweig. 2013c. Linguistic regularities in continuous space word representations. In Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 746–751, Atlanta, Georgia, June. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Andriy Mnih and Geoffrey E Hinton. 2008. A scalable hierarchical distributed language model. In Advances in Neural Information Processing Systems, pages 1081–1088.
    Google ScholarLocate open access versionFindings
  • 2007. Dependency-based construction of semantic space models. Computational Linguistics, 33(2):161–199.
    Google ScholarLocate open access versionFindings
  • Alan Ritter, Mausam, and Oren Etzioni. 2010. A latent dirichlet allocation method for selectional preferences. In ACL, pages 424–434.
    Google ScholarLocate open access versionFindings
  • Diarmuid O Seaghdha. 2010. Latent variable models of selectional preference. In ACL, pages 435–444.
    Google ScholarLocate open access versionFindings
  • Richard Socher, Jeffrey Pennington, Eric H Huang, Andrew Y Ng, and Christopher D Manning. 2011. Semi-supervised recursive autoencoders for predicting sentiment distributions. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, pages 151–161. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Kristina Toutanova, Dan Klein, Chris Manning, and Yoram Singer. 2003. Feature-rich part-of-speech tagging with a cyclic dependency network. In Proceedings of NAACL.
    Google ScholarLocate open access versionFindings
  • Joseph Turian, Lev Ratinov, and Yoshua Bengio. 2010. Word representations: a simple and general method for semi-supervised learning. In Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics, pages 384–394. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • P.D. Turney and P. Pantel. 2010. From frequency to meaning: Vector space models of semantics. Journal of Artificial Intelligence Research, 37(1):141– 188.
    Google ScholarLocate open access versionFindings
  • Peter D. Turney. 2012. Domain and function: A dualspace model of semantic relations and compositions. Journal of Artificial Intelligence Research, 44:533– 585.
    Google ScholarLocate open access versionFindings
  • Jakob Uszkoreit and Thorsten Brants. 2008. Distributed word clustering for large scale class-based language modeling in machine translation. In Proc. of ACL, pages 755–762.
    Google ScholarLocate open access versionFindings
Your rating :
0

 

Tags
Comments