Chinese Word Embedding Learning with Limited Data

WEB AND BIG DATA, APWEB-WAIM 2021, PT I(2021)

引用 0|浏览3
暂无评分
摘要
With the increasing demands of high-quality Chinese word embeddings for natural language processing, Chinese word embedding learning has attracted wide attention in recent years. Most of the existing research focused on capturing word semantics on large-scaled datasets. However, these methods are difficult to obtain effective word embeddings with limited data used for some specific fields. Observing the rich semantic information of Chinese fine-grained structures, we develop a model to fully fuse Chinese fine-grained structures as auxiliary information for word embedding learning. The proposed model views the word context information as a combination of word, character, pronunciation, and component. Besides, it adds the semantic relationship between pronunciations and components as a constraint to exploit auxiliary information comprehensively. Based on the decomposition of shifted positive pointwise mutual information matrix, our model could effectively generate Chinese word embeddings on small-scaled data. The results of word analogy, word similarity, and name entity recognition conducted on two public datasets show the effectiveness of our proposed model for capturing Chinese word semantics with limited data.
更多
查看译文
关键词
Chinese word embedding, Matrix factorization, Chinese fine-grained information
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要