NERA 2.0: Improving coverage and performance of rule-based named entity recognition for Arabic.

NATURAL LANGUAGE ENGINEERING(2017)

引用 37|浏览35
暂无评分
摘要
Named Entity Recognition (NER) is an essential task for many natural language processing systems, which makes use of various linguistic resources. NER becomes more complicated when the language in use is morphologically rich and structurally complex, such as Arabic. This language has a set of characteristics that makes it particularly challenging to handle. In a previous work, we have proposed an Arabic NER system that follows the hybrid approach, i.e. integrates both rule-based and machine learning-based NER approaches. Our hybrid NER system is the state-of-the-art in Arabic NER according to its performance on standard evaluation datasets. In this article, we discuss a novel methodology for overcoming the coverage drawback of rule-based NER systems in order to improve their performance and allow for automated rule update. The presented mechanism utilizes the recognition decisions made by the hybrid NER system in order to identify the weaknesses of the rule-based component and derive new linguistic rules aiming at enhancing the rule base, which will help in achieving more reliable and accurate results. We used ACE 2004 Newswire standard dataset as a resource for extracting and analyzing new linguistic rules for person, location and organization names recognition. We formulate each new rule based on two distinctive feature groups, i.e. Gazetteers of each type of named entities and Part-of-Speech tags, in particular noun and proper noun. Fourteen new patterns are derived, formulated as grammar rules, and evaluated in terms of coverage. The conducted experiments exploit a POS tagged version of the ACE 2004 NW dataset. The empirical results show that the performance of the enhanced rule-based system, i.e. NERA 2.0, improves the coverage of the previously misclassified person, location and organization named entities types by 69.93 per cent, 57.09 per cent and 54.28 per cent, respectively.
更多
查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要