Empirical studies on the impact of lexical resources on CLIR performance

Information Processing & Management(2005)

引用 39|浏览6
In this paper, we compile and review several experiments measuring cross-lingual information retrieval (CLIR) performance as a function of the following resources: bilingual term lists, parallel corpora, machine translation (MT), and stemmers. Our CLIR system uses a simple probabilistic language model; the studies used TREC test corpora over Chinese, Spanish and Arabic. Our findings include: • One can achieve an acceptable CLIR performance using only a bilingual term list (70-80% on Chinese and Arabic corpora). • However, if a bilingual term list and parallel corpora are available, CLIR performance can rival monolingual performance. • If no parallel corpus is available, pseudo-parallel texts produced by an MT system can partially overcome the lack of parallel text. • While stemming is useful normally, with a very large parallel corpus for Arabic-English, stemming hurt performance in our empirical studies with Arabic, a highly inflected language.
stemming,bilingual term list,clir performance,lexical resource,machine translation,arabic corpus,monolingual performance,empirical study,acceptable clir performance,bilingual lexicons,parallel text,clir system,parallel texts,cross-lingual retrieval,mt system,parallel corpus,large parallel corpus,language model
AI 理解论文
Chat Paper