Jointly Training Large Autoregressive Multimodal Models

Emanuele Aiello,Lili Yu,Yixin Nie,Armen Aghajanyan,Barlas Oguz

ICLR 2024（2023）

引用 0|浏览40

暂无评分

摘要

In recent years, advances in the large-scale pretraining of language and text-to-image models have revolutionized the field of machine learning. Yet, integrating these two modalities into a single, robust model capable of generating seamless multimodal outputs remains a significant challenge. To address this gap, we present the Joint Autoregressive Mixture (JAM) framework, a modular approach that systematically fuses existing text and image generation models. We also introduce a specialized, data-efficient instruction-tuning strategy, tailored for mixed-modal generation tasks. Our final instruct-tuned model demonstrates unparalleled performance in generating high-quality multimodal outputs and represents the first model explicitly designed for this purpose.

查看译文

关键词

Large Multimodal Models,Joint Training,Interleaved Image-Text Generation,Autoregressive Models

AI 理解论文

溯源树

样例

生成溯源树，研究论文发展脉络

Chat Paper

正在生成论文摘要