Efficiency-Effectiveness Tradeoff of Probabilistic Structured Queries for Cross-Language Information Retrieval
arxiv(2024)
摘要
Probabilistic Structured Queries (PSQ) is a cross-language information
retrieval (CLIR) method that uses translation probabilities statistically
derived from aligned corpora. PSQ is a strong baseline for efficient CLIR using
sparse indexing. It is, therefore, useful as the first stage in a cascaded
neural CLIR system whose second stage is more effective but too inefficient to
be used on its own to search a large text collection. In this reproducibility
study, we revisit PSQ by introducing an efficient Python implementation.
Unconstrained use of all translation probabilities that can be estimated from
aligned parallel text would in the limit assign a weight to every vocabulary
term, precluding use of an inverted index to serve queries efficiently. Thus,
PSQ's effectiveness and efficiency both depend on how translation probabilities
are pruned. This paper presents experiments over a range of modern CLIR test
collections to demonstrate that achieving Pareto optimal PSQ
effectiveness-efficiency tradeoffs benefits from multi-criteria pruning, which
has not been fully explored in prior work. Our Python PSQ implementation is
available on GitHub(https://github.com/hltcoe/PSQ) and unpruned translation
tables are available on Huggingface
Models(https://huggingface.co/hltcoe/psq_translation_tables).
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要