T-CLAP: Temporal-Enhanced Contrastive Language-Audio Pretraining
arxiv(2024)
摘要
Contrastive language-audio pretraining (CLAP) has been developed to align the
representations of audio and language, achieving remarkable performance in
retrieval and classification tasks. However, current CLAP struggles to capture
temporal information within audio and text features, presenting substantial
limitations for tasks such as audio retrieval and generation. To address this
gap, we introduce T-CLAP, a temporal-enhanced CLAP model. We use Large Language
Models (LLMs) and mixed-up strategies to generate temporal-contrastive captions
for audio clips from extensive audio-text datasets. Subsequently, a new
temporal-focused contrastive loss is designed to fine-tune the CLAP model by
incorporating these synthetic data. We conduct comprehensive experiments and
analysis in multiple downstream tasks. T-CLAP shows improved capability in
capturing the temporal relationship of sound events and outperforms
state-of-the-art models by a significant margin.
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要