On Generative Spoken Language Modeling from Raw Audio

Kushal Lakhotia, Evgeny Kharitonov,Wei-Ning Hsu,Yossi Adi,Adam Polyak,Benjamin Bolte,Tu-Anh Nguyen,Jade Copet,Alexei Baevski,Adelrahman Mohamed,Emmanuel Dupoux

TRANSACTIONS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS（2021）

引用 14|浏览39

暂无评分

摘要

We introduce Generative Spoken Language Modeling, the task of learning the acoustic and linguistic characteristics of a language from raw audio (no text, no labels), and a set of metrics to automatically evaluate the learned representations at acoustic and linguistic levels for both encoding and generation. We set up baseline systems consisting of a discrete speech encoder (returning pseudo-text units), a generative language model (trained on pseudotext), and a speech decoder (generating a waveform from pseudo-text) all trained without supervision and validate the proposed metrics with human evaluation. Across 3 speech encoders (CPC, wav2vec 2.0, HuBERT), we find that the number of discrete units (50, 100, or 200) matters in a task-dependent and encoder-dependent way, and that some combinations approach text-based systems.(1)

查看译文

AI 理解论文

溯源树

样例

生成溯源树，研究论文发展脉络

Chat Paper

正在生成论文摘要