Ranked Document Retrieval in External Memory.

Rahul Shah,Cheng Sheng,Sharma V. Thankachan,Jeffrey Vitter

ACM Trans. Algorithms（2023）

引用 0|浏览26

暂无评分

摘要

The ranked (or top-k) document retrieval problem is defined as follows: preprocess a collection {T-1, T-2, . . . , T-d} of d strings (called documents) of total length n into a data structure, such that for any given query (P, k), where P is a string (called pattern) of length p >= 1 and k is an element of[1, d] is an integer, the identifiers of those k documents that are most relevant to P can be reported, ideally in the sorted order of their relevance. The seminal work by Hon et al. [FOCS 2009 and Journal of the ACM 2014] presented an O(n)-space (in words) data structure with O(p + k log k) query time. The query time was later improved to O(p + k) [SODA 2012] and further to O(p/logs(sigma) n + k) [SIAM Journal on Computing 2017] by Navarro and Nekrich, where sigma is the alphabet size. We revisit this problem in the external memory model and present three data structures. The first one takes O(n)-space and answer queries in O(p/B + log(B) n + k/B + log* (n/B)) I/Os, where B is the block size. The second one takes O(n log* (n/B)) space and answer queries in optimal O(p/B + log(B) n + k/B) I/Os. In both cases, the answers are reported in the unsorted order of relevance. To handle sorted top-k document retrieval, we present an O(n log(d/B)) space data structure with optimal query cost.

查看译文

关键词

Data structures,text indexing,external memory

AI 理解论文

溯源树

样例

生成溯源树，研究论文发展脉络

Chat Paper

正在生成论文摘要