Combining pretrained speech and text encoders for spoken language processing

ICLR 2023(2023)

引用 0|浏览8
暂无评分
摘要
Spoken Language Processing tasks that extract information from speech signal, have the advantage of using both speech and text modalities. In this paper, we propose to combine pretrained speech and text encoders via cross-attention, and we show the application of the proposed architecture in multiple spoken language processing systems. Our results indicate that it's more efficient to re-purpose previously trained independent modality encoders and learn only cross-attention from scratch. This resultant architecture captures both acoustic and lexical information, and performs text tagging while attending to speech encoder for improved results. We use compact pretrained speech and text encoder which are resource efficient and can be trained on a single consumer GPU card.
更多
查看译文
关键词
Spoken language processing,Multi-modal SLU,Encoder fusion
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要