Cluster Based Chinese Abbreviation Modeling

15TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2014), VOLS 1-4（2014）

引用 24|浏览22

暂无评分

摘要

Abbreviations in Chinese are widely observed in Chinese spoken language. Automatic generation of Chinese abbreviations helps to improve Chinese natural language understanding systems and Chinese search engine. The abbreviation generation is treated as a character-based tagging problem. Due to limited training data, Chinese abbreviation generation suffers from data sparseness. Two types of strategies are proposed to reduce the impact from data sparseness. First of all, in addition to using a traditional sequence labelling method Conditional Random Fields (CRF), we propose to apply Recurrent Neural Network with Maximum Entropy Extension (RNNME) [9], which actually shows similar performance as using CRF in our experiment. Secondly, we propose to use training data clustering and latent topic modeling in abbreviation generation. Using training data clustering or topic modeling not only addresses the data sparseness, but also takes advantage of the fact that full-names from the same cluster or the same latent topic have similar abbreviation patterns. Our experimental results show that using manual clustering, the accuracy of abbreviation generation achieves relatively 8% improvement. Using Latent topics that are obtained from Latent Dirichlet Allocation (LDA), the accuracy achieves relative 10% improvement.

查看译文

关键词

Chinese name abbreviation,Conditional Random Field,Recurrent Neural Network

AI 理解论文

溯源树

样例

生成溯源树，研究论文发展脉络

Chat Paper

正在生成论文摘要