
Automatic Diacritization of Tunisian Dialect Text Using SMT Model

International journal of speech technology(2021)

引用 1|浏览3
Unlike other tongues, Arabic language is characterized by its written form which is essentially consonant and may not have short vowels. One of the major functions of short vowels is to determine and facilitate the meaning of words or sentences. However, MSA texts are generally written without vowels. This fact gives rise to a great deal of morphological, semantic, and syntactic ambiguities. Thus, this ambiguity problem is not only associated with Modern Standard Arabic (MSA) but also related to Arabic dialects in general and Tunisian Dialect (TD) in particular. Compared to MSA, TD suffers from the unavailability of basic tools and linguistic resources, like sufficient amount of corpora, multilingual dictionaries, morphological and syntactic analyzers of these resources makes the processing of this language a great challenge (Masmoudi et al., 2020). Despite the numerous efforts currently underway, still some shortages persist in this field. Hence, we tried to challenge this lack by presenting our work that investigates the automatic diacritization of TD texts. In this respect, we regard the diacritization problem as a simplified phrase-based SMT (Statistical Machine Translation) task. The source language is the undiacritic text while the target language is the diacritic text. We initially go deeper into the details of TD corpus creation. This corpus is finally approved and used to build a diacritic restoration system for the TD. It is called TDTACHKIL and it can achieve a Word Error Rate (WER) of 16.7% and Diacritic Error Rate (DER) of 8.89%.
Diacritization,Tunisian dialect,Modern Standard Arabic,SMT model,Tunisian corpus
AI 理解论文
Chat Paper