MiniGPT-5: Interleaved Vision-and-Language Generation via Generative Vokens
arxiv(2023)
摘要
The effectiveness of Multimodal Large Language Models (MLLMs) demonstrates a
profound capability in multimodal understanding. However, the simultaneous
generation of images with coherent texts is still underdeveloped. Addressing
this, we introduce a novel interleaved vision-and-language generation method,
centered around the concept of “generative vokens". These vokens serve as
pivotal elements contributing to coherent image-text outputs. Our method is
marked by a unique two-stage training strategy for description-free multimodal
generation, which does not necessitate extensive descriptions of images. We
integrate classifier-free guidance to enhance the alignment of generated images
and texts, ensuring more seamless and contextually relevant multimodal
interactions. Our model, MiniGPT-5, exhibits substantial improvement over the
baseline models on multimodal generation datasets, including MMDialog and VIST.
The human evaluation shows MiniGPT-5 is better than the baseline model on more
than 56% cases for multimodal generation, highlighting its efficacy across
diverse benchmarks.
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要