Efficient Fully-Compressed Sequence Representations

Algorithmica(2012)

引用 57|浏览80
暂无评分
摘要
We present a data structure that stores a sequence s [1.. n ] over alphabet [1.. σ ] in nℋ_0(s) + o(n)(ℋ_0(s)+1) bits, where ℋ_0(s) is the zero-order entropy of s . This structure supports the queries access, rank and select, which are fundamental building blocks for many other compressed data structures, in worst-case time 𝒪 ( σ ) and average time 𝒪 ( ℋ_0(s) ) . The worst-case complexity matches the best previous results, yet these had been achieved with data structures using nℋ_0(s)+o(nσ) bits. On highly compressible sequences the o ( n lg σ ) bits of the redundancy may be significant compared to the nℋ_0(s) bits that encode the data. Our representation, instead, compresses the redundancy as well. Moreover, our average-case complexity is unprecedented. Our technique is based on partitioning the alphabet into characters of similar frequency. The subsequence corresponding to each group can then be encoded using fast uncompressed representations without harming the overall compression ratios, even in the redundancy. The result also improves upon the best current compressed representations of several other data structures. For example, we achieve (i) compressed redundancy, retaining the best time complexities, for the smallest existing full-text self-indexes; (ii) compressed permutations π with times for π () and π −1 () improved to loglogarithmic; and (iii) the first compressed representation of dynamic collections of disjoint sets. We also point out various applications to inverted indexes, suffix arrays, binary relations, and data compressors. Our structure is practical on large alphabets. Our experiments show that, as predicted by theory, it dominates the space/time tradeoff map of all the sequence representations, both in synthetic and application scenarios.
更多
查看译文
关键词
Compressed sequence representations,Rank and select on sequences,Compact data structures,Entropy-bounded structures,Compressed text indexing
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要