Speech Wikimedia: A 77 Language Multilingual Speech Dataset

Rafael Mosquera Gómez,Julián Eusse, Juan Ciro,Daniel Galvez, Ryan Hileman, Kurt Bollacker,David Kanter

CoRR(2023)

引用 0|浏览10
暂无评分
摘要
The Speech Wikimedia Dataset is a publicly available compilation of audio with transcriptions extracted from Wikimedia Commons. It includes 1780 hours (195 GB) of CC-BY-SA licensed transcribed speech from a diverse set of scenarios and speakers, in 77 different languages. Each audio file has one or more transcriptions in different languages, making this dataset suitable for training speech recognition, speech translation, and machine translation models.
更多
查看译文
关键词
language,dataset
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要