Temporal Shift Module with Pretrained Representations for Speech Emotion Recognition

Intelligent Computing(2024)

引用 0|浏览5
暂无评分
摘要
Recent advances in self-supervised models have led to effective pretrained speech representations in downstream speech emotion recognition tasks. However, previous research has primarily focused on exploiting pretrained representations by simply adding a linear head on top of the pretrained model, while overlooking the design of the downstream network. In this paper, we propose a temporal shift module with pretrained representations to integrate channel-wise information without introducing additional parameters or floating-point operations per second. By incorporating the temporal shift module, we developed corresponding shift variants for 3 baseline building blocks: ShiftCNN, ShiftLSTM, and Shiftformer. Furthermore, we propose 2 technical strategies, placement and proportion of shift, to balance the trade-off between mingling and misalignment. Our family of temporal shift models outperforms state-of-the-art methods on the benchmark Interactive Emotional Dyadic Motion Capture dataset in fine-tuning and feature-extraction scenarios. In addition, through comprehensive experiments using wav2vec 2.0 and Hidden-Unit Bidirectional Encoder Representations from Transformers representations, we identified the behavior of the temporal shift module in downstream models, which may serve as an empirical guideline for future exploration of channel-wise shift and downstream network design.
更多
查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要