Generating Co-Speech Gestures for Virtual Agents from Multimodal Information Based on Transformer.

Yue Yu, Jiande Shi

VRW（2023）

引用 0|浏览2

暂无评分

摘要

To generate co-speech gestures for virtual agents and enhance the correlation between gestures and input modalities, we propose a Transformer-based model, which encodes tour-modal-like information (Audio Waveform, Mel-Spectrogram, Text, and SpeakerlDs). For the Mel-Spectrogram modal, we design a Mel-Spectrogram encoder based on the Swin Transformer pre-trained model to extract the audio spectrum features hierarchically. For the Text modal, we use the Transformer encoder to extract text features aligned with the audio. We evaluate on the TED-Gesture dataset. Compared with the state-of-art methods, we improve the mean absolute joint error by 2.33%, the mean acceleration difference by 15.01%, and the Frechet gesture distance by 59.32%.

查看译文

AI 理解论文

溯源树

样例

生成溯源树，研究论文发展脉络

Chat Paper

正在生成论文摘要