TokenVerse: Versatile Multi-concept Personalization in Token Modulation Space

Daniel Garibi,Shahar Yadin,Roni Paiss,Omer Tov,Shiran Zada,Ariel Ephrat,Tomer Michaeli,Inbar Mosseri,Tali Dekel

CoRR（2025）

Cited 0|Views2

Abstract

We present TokenVerse – a method for multi-concept personalization, leveraging a pre-trained text-to-image diffusion model. Our framework can disentangle complex visual elements and attributes from as little as a single image, while enabling seamless plug-and-play generation of combinations of concepts extracted from multiple images. As opposed to existing works, TokenVerse can handle multiple images with multiple concepts each, and supports a wide-range of concepts, including objects, accessories, materials, pose, and lighting. Our work exploits a DiT-based text-to-image model, in which the input text affects the generation through both attention and modulation (shift and scale). We observe that the modulation space is semantic and enables localized control over complex concepts. Building on this insight, we devise an optimization-based framework that takes as input an image and a text description, and finds for each word a distinct direction in the modulation space. These directions can then be used to generate new images that combine the learned concepts in a desired configuration. We demonstrate the effectiveness of TokenVerse in challenging personalization settings, and showcase its advantages over existing methods. project's webpage in https://token-verse.github.io/

Translated text

Bibtex

AI Read Science

Must-Reading Tree

Example

Generate MRT to find the research sequence of this paper

Data Disclaimer

The page data are from open Internet sources, cooperative publishers and automatic analysis results through AI technology. We do not make any commitments and guarantees for the validity, accuracy, correctness, reliability, completeness and timeliness of the page data. If you have any questions, please contact us by email: report@aminer.cn

Chat Paper

【要点】：论文提出了TokenVerse方法，利用预训练的文本到图像扩散模型实现多概念个性化，能够在一张图片中解析复杂视觉元素和属性，并支持多张图片中多个概念的混合生成。

【方法】：TokenVerse方法采用DiT基础的文本到图像模型，通过注意力和调制（平移和缩放）来影响图像生成，优化框架根据输入图像和文本描述在调制空间中为每个单词找到特定方向，以实现概念组合的新图像生成。

【实验】：作者在多个挑战性的个性化设置中展示了TokenVerse的有效性，并在项目网页(https://token-verse.github.io/)上展示了其相较于现有方法的优越性，具体使用的实验数据集未在摘要中提及。

去 AI 文献库对话