ALLaVA: Harnessing GPT4V-synthesized Data for A Lite Vision-Language Model
CoRR(2024)
摘要
Recent advancements in Large Vision-Language Models (LVLMs) have enabled
processing of multimodal inputs in language models but require significant
computational resources for deployment, especially in edge devices. This study
aims to bridge the performance gap between traditional-scale LVLMs and
resource-friendly lite versions by adopting high-quality training data. To do
this, a synthetic dataset is created by leveraging GPT-4V's ability to generate
detailed captions, complex reasoning instructions and detailed answers from
images. The resulted model trained with our data, ALLaVA, achieves competitive
performance on 12 benchmarks up to 3B LVLMs. This work highlights the
feasibility of adopting high-quality data in crafting more efficient LVLMs. Our
online demo is available at .
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要