Music Conditioned Generation for Human-Centric Video

IEEE SIGNAL PROCESSING LETTERS(2024)

引用 0|浏览0
暂无评分
摘要
Music and human-centric video are two fundamental signals across languages. Correlation analysis between the two is currently used in choreography and film accompaniment. This letter explores this correlation in a new task: human-centric video generation from a start-end image pair and transitional music. Existing human-centric generation methods are not competent for this task because they require frame-wise pose as input or have difficulty handling long-duration videos. Our key idea is to build a temporal generation framework dominated by DDPM and assisted by VAE and GAN. It reduces the computational cost of music-image diffusion by utilizing the latent space compactness of VAE and the image translation efficiency of GAN. To produce videos with both long duration and high quality, our framework first generates small-scale keyframes and then generates high-resolution videos. To strengthen the frame-wise consistency of the human body, a frame-aligned correspondence map is adopted as an intermediate supervision. Extensive experiments compared with the SOTA method have demonstrated the rationality and effectiveness of this signal generation framework.
更多
查看译文
关键词
Multiple signal classification,Generative adversarial networks,Correlation,Visualization,Training,Task analysis,Feature extraction,Video generation,signal processing,cross-modal learning,human-centric
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要