Training-Free Consistent Text-to-Image Generation
CoRR(2024)
摘要
Text-to-image models offer a new level of creative flexibility by allowing
users to guide the image generation process through natural language. However,
using these models to consistently portray the same subject across diverse
prompts remains challenging. Existing approaches fine-tune the model to teach
it new words that describe specific user-provided subjects or add image
conditioning to the model. These methods require lengthy per-subject
optimization or large-scale pre-training. Moreover, they struggle to align
generated images with text prompts and face difficulties in portraying multiple
subjects. Here, we present ConsiStory, a training-free approach that enables
consistent subject generation by sharing the internal activations of the
pretrained model. We introduce a subject-driven shared attention block and
correspondence-based feature injection to promote subject consistency between
images. Additionally, we develop strategies to encourage layout diversity while
maintaining subject consistency. We compare ConsiStory to a range of baselines,
and demonstrate state-of-the-art performance on subject consistency and text
alignment, without requiring a single optimization step. Finally, ConsiStory
can naturally extend to multi-subject scenarios, and even enable training-free
personalization for common objects.
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要