Mitigating Vocabulary Mismatch On Multi-Domain Corpus Using Word Embeddings And Thesaurus

Nagesh Yadav, Alessandro Dibari,Miao Wei,John Segrave-Daly, Conor Cullen,Denisa Moga,Jillian Scalvini, Ciaran Hennessy,Morten Kristiansen, Omar O’Sullivan

ICAART: PROCEEDINGS OF THE 12TH INTERNATIONAL CONFERENCE ON AGENTS AND ARTIFICIAL INTELLIGENCE, VOL 1(2020)

引用 0|浏览3
暂无评分
摘要
Query expansion is an extensively researched topic in the field of information retrieval that helps to bridge the vocabulary mismatch problem, i.e., the way users express concepts differs from the way they appear in the corpus. In this paper, we propose a query-expansion technique for searching a corpus that contains a mix of terminology from several domains - some of which have well-curated thesauri and some of which do not. An iterative fusion technique is proposed that exploits thesauri for those domains that have them, and word embeddings for those that do not. For our experiments, we have used a corpus of Medicaid healthcare policies that contain a mix of terminology from medical and insurance domains. The Unified Medical Language System (UMLS) thesaurus was used to expand medical concepts and a word embeddings model was used to expand non-medical concepts. The technique was evaluated against elastic search using no expansion. The results show 8% improvement in recall and 12% improvement in mean average precision.
更多
查看译文
关键词
Information Retrieval, Query Expansion, Word Embedding, Thesaurus
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要