Crossmodal Clustered Contrastive Learning: Grounding of Spoken Language to Gesture

Multimodal Interfaces and Machine Learning for Multimodal Interaction(2021)

引用 3|浏览25
暂无评分
摘要
ABSTRACT Crossmodal grounding is a key technical challenge when generating relevant and well-timed gestures from spoken language. Often, the same gesture can accompany semantically different spoken language phrases which makes crossmodal grounding especially challenging. For example, a gesture (semi-circular with both hands) could co-occur with semantically different phrases ”entire bottom row” (referring to a physical point) and ”molecules expand and decay” (referring to a scientific phenomena). In this paper, we introduce a self-supervised approach to learn representations better suited to such many-to-one grounding relationships between spoken language and gestures. As part of this approach, we propose a new contrastive loss function, Crossmodal Cluster NCE, that guides the model to learn spoken language representations which are consistent with the similarities in the gesture space. This gesture-aware space can help us generate more relevant gestures given language as input. We demonstrate the effectiveness of our approach on a publicly available dataset through quantitative and qualitative evaluations. Our proposed methodology significantly outperforms prior approaches for gestures-language grounding. Link to code: https://github.com/dondongwon/CC_NCE_GENEA.
更多
查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要