Cross-stitched Multi-modal Encoders

Karan Singla,Daniel Pressel,Ryan Price, Bhargav Srinivas Chinnari,Yeon-Jun Kim,Srinivas Bangalore

arxiv（2022）

引用 0|浏览7

暂无评分

摘要

In this paper, we propose a novel architecture for multi-modal speech and text input. We combine pretrained speech and text encoders using multi-headed cross-modal attention and jointly fine-tune on the target problem. The resultant architecture can be used for continuous token-level classification or utterance-level prediction acting on simultaneous text and speech. The resultant encoder efficiently captures both acoustic-prosodic and lexical information. We compare the benefits of multi-headed attention-based fusion for multi-modal utterance-level classification against a simple concatenation of pre-pooled, modality-specific representations. Our model architecture is compact, resource efficient, and can be trained on a single consumer GPU card.

查看译文

关键词

encoders,cross-stitched,multi-modal

AI 理解论文

溯源树

样例

生成溯源树，研究论文发展脉络

Chat Paper

正在生成论文摘要