Image and Video Tokenization with Binary Spherical Quantization
CoRR(2024)
摘要
We propose a new transformer-based image and video tokenizer with Binary
Spherical Quantization (BSQ). BSQ projects the high-dimensional visual
embedding to a lower-dimensional hypersphere and then applies binary
quantization. BSQ is (1) parameter-efficient without an explicit codebook, (2)
scalable to arbitrary token dimensions, and (3) compact: compressing visual
data by up to 100× with minimal distortion. Our tokenizer uses a
transformer encoder and decoder with simple block-wise causal masking to
support variable-length videos as input. The resulting BSQ-ViT achieves
state-of-the-art visual reconstruction quality on image and video
reconstruction benchmarks with 2.4× throughput compared to the best
prior methods. Furthermore, by learning an autoregressive prior for adaptive
arithmetic coding, BSQ-ViT achieves comparable results on video compression
with state-of-the-art video compression standards. BSQ-ViT also enables masked
language models to achieve competitive image synthesis quality to GAN- and
diffusion-based methods.
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要