SPARO: Selective Attention for Robust and Compositional Transformer Encodings for Vision
arxiv(2024)
摘要
Selective attention helps us focus on task-relevant aspects in the constant
flood of our sensory input. This constraint in our perception allows us to
robustly generalize under distractions and to new compositions of perceivable
concepts. Transformers employ a similar notion of attention in their
architecture, but representation learning models with transformer backbones
like CLIP and DINO often fail to demonstrate robustness and compositionality.
We highlight a missing architectural prior: unlike human perception,
transformer encodings do not separately attend over individual concepts. In
response, we propose SPARO, a read-out mechanism that partitions encodings into
separately-attended slots, each produced by a single attention head. Using
SPARO with CLIP imparts an inductive bias that the vision and text modalities
are different views of a shared compositional world with the same corresponding
concepts. Using SPARO, we demonstrate improvements on downstream recognition,
robustness, retrieval, and compositionality benchmarks with CLIP (up to +14
for ImageNet, +4
for ImageNet with DINO (+3
intervene and select individual SPARO concepts to further improve downstream
task performance (up from +4
study the robustness of SPARO's representation structure. Finally, we provide
insights through ablation experiments and visualization of learned concepts.
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要