ViTamin: Designing Scalable Vision Models in the Vision-Language Era
arxiv(2024)
摘要
Recent breakthroughs in vision-language models (VLMs) start a new page in the
vision community. The VLMs provide stronger and more generalizable feature
embeddings compared to those from ImageNet-pretrained models, thanks to the
training on the large-scale Internet image-text pairs. However, despite the
amazing achievement from the VLMs, vanilla Vision Transformers (ViTs) remain
the default choice for the image encoder. Although pure transformer proves its
effectiveness in the text encoding area, it remains questionable whether it is
also the case for image encoding, especially considering that various types of
networks are proposed on the ImageNet benchmark, which, unfortunately, are
rarely studied in VLMs. Due to small data/model scale, the original conclusions
of model design on ImageNet can be limited and biased. In this paper, we aim at
building an evaluation protocol of vision models in the vision-language era
under the contrastive language-image pretraining (CLIP) framework. We provide a
comprehensive way to benchmark different vision models, covering their
zero-shot performance and scalability in both model and training data sizes. To
this end, we introduce ViTamin, a new vision models tailored for VLMs.
ViTamin-L significantly outperforms ViT-L by 2.0
when using the same publicly available DataComp-1B dataset and the same
OpenCLIP training scheme. ViTamin-L presents promising results on 60 diverse
benchmarks, including classification, retrieval, open-vocabulary detection and
segmentation, and large multi-modal models. When further scaling up the model
size, our ViTamin-XL with only 436M parameters attains 82.9
accuracy, surpassing 82.0
(4.4B).
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要