SADA: Saudi Audio Dataset for Arabic

Sadeen Alharbi, Areeb Alowisheq, Zoltán Tüske,Kareem Darwish, Abdullah Alrajeh, Abdulmajeed Alrowithi, Aljawharah Bin Tamran, Asma Ibrahim, Raghad Aloraini, Raneem Alnajim, Ranya Alkahtani, Renad Almuasaad, Sara Alrasheed, Shaykhah Alsubaie, Yaser Alonaizan

ICASSP 2024 - 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)（2024）

引用 0|浏览1

暂无评分

摘要

Arabic is among the most challenging languages in the world. Unfortunately, the scarcity of Arabic datasets makes studies in Arabic speech technology demanding. This paper introduces SADA, the Saudi Audio Dataset for Arabic, with 668 hours of high-quality audio suitable for supervised training. The audio recordings were sourced from 57 television shows provided by the Saudi Broadcasting Authority. The audio covers both read and spontaneous speaking styles in various genres. The National Center for Artificial Intelligence in Saudi Arabia transcribed and prepared the data for training and processing. The recordings are in Arabic. Most are in Saudi dialects, while other Arabic dialects include Yemeni, Egyptian, and Levantine. The dataset is split into training, validation, and testing sets to enhance its usage. The validation and testing sets contain 10 hours of audio segments each. Besides giving a detailed description of the dataset, wide range of speech recognition experiments using standard tools are also presented.

查看译文

关键词

Arabic dataset,dialectal Arabic data

AI 理解论文

溯源树

样例

生成溯源树，研究论文发展脉络

Chat Paper

正在生成论文摘要