IndoRobusta: Towards Robustness Against Diverse Code-Mixed Indonesian Local Languages.

Muhammad Farid Adilazuarda,Samuel Cahyawijaya,Genta Indra Winata,Pascale Fung,Ayu Purwarianti

CoRR（2023）

引用 0|浏览7

暂无评分

摘要

Significant progress has been made on Indonesian NLP. Nevertheless, exploration of the code-mixing phenomenon in Indonesian is limited, despite many languages being frequently mixed with Indonesian in daily conversation. In this work, we explore code-mixing in Indonesian with four embedded languages, i.e., English, Sundanese, Javanese, and Malay; and introduce IndoRobusta, a framework to evaluate and improve the code-mixing robustness. Our analysis shows that the pre-training corpus bias affects the model's ability to better handle Indonesian-English code-mixing when compared to other local languages, despite having higher language diversity.

查看译文

AI 理解论文

溯源树

样例

生成溯源树，研究论文发展脉络

Chat Paper

正在生成论文摘要