A Manifold Representation of the Key in Vision Transformers
CoRR(2024)
摘要
Vision Transformers implement multi-head self-attention (MSA) via stacking
multiple attention blocks. The query, key, and value are often intertwined and
generated within those blocks via a single, shared linear transformation. This
paper explores the concept of disentangling the key from the query and value,
and adopting a manifold representation for the key. Our experiments reveal that
decoupling and endowing the key with a manifold structure can enhance the model
performance. Specifically, ViT-B exhibits a 0.87
while Swin-T sees a boost of 0.52
dataset, with eight charts in the manifold key. Our approach also yields
positive results in object detection and instance segmentation tasks on the
COCO dataset. Through detailed ablation studies, we establish that these
performance gains are not merely due to the simplicity of adding more
parameters and computations. Future research may investigate strategies for
cutting the budget of such representations and aim for further performance
improvements based on our findings.
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要