ST-LDM: A Universal Framework for Text-Grounded Object Generation in Real Images
CoRR(2024)
摘要
We present a novel image editing scenario termed Text-grounded Object
Generation (TOG), defined as generating a new object in the real image
spatially conditioned by textual descriptions. Existing diffusion models
exhibit limitations of spatial perception in complex real-world scenes, relying
on additional modalities to enforce constraints, and TOG imposes heightened
challenges on scene comprehension under the weak supervision of linguistic
information. We propose a universal framework ST-LDM based on Swin-Transformer,
which can be integrated into any latent diffusion model with training-free
backward guidance. ST-LDM encompasses a global-perceptual autoencoder with
adaptable compression scales and hierarchical visual features, parallel with
deformable multimodal transformer to generate region-wise guidance for the
subsequent denoising process. We transcend the limitation of traditional
attention mechanisms that only focus on existing visual features by introducing
deformable feature alignment to hierarchically refine spatial positioning fused
with multi-scale visual and linguistic information. Extensive Experiments
demonstrate that our model enhances the localization of attention mechanisms
while preserving the generative capabilities inherent to diffusion models.
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要