Persian offensive language detection

Emad Kebriaei, Ali Homayouni, Roghayeh Faraji, Armita Razavi,Azadeh Shakery,Heshaam Faili,Yadollah Yaghoobzadeh

Machine Learning(2023)

引用 0|浏览2
暂无评分
摘要
With the proliferation of social networks and their impact on human life, one of the rising problems in this environment is the rise in verbal and written insults and hatred. As one of the significant platforms for distributing text-based content, Twitter frequently publishes its users’ abusive remarks. Creating a model that requires a complete collection of offensive sentences is the initial stage in recognizing objectionable phrases. In addition, despite the abundance of resources in English and other languages, there are limited resources and studies on identifying hateful and offensive statements in Persian. In this study, we compiled a 38K-tweet dataset of Persian Hate and Offensive language using keyword-based data selection strategies. A Persian offensive lexicon and nine hatred target group lexicons were gathered through crowdsourcing for this purpose. The dataset was annotated manually so that at least two annotators investigated tweets. In addition, for the purpose of analyzing the effect of used lexicons on language model functionality, we employed two assessment criteria (FPED and pAUCED) to measure the dataset’s potential bias. Then, by configuring the dataset based on the results of the bias measurement, we mitigated the effect of words’ bias in tweets on language model performance. The results indicate that bias is significantly diminished, while less than a hundredth reduced the F1 score.
更多
查看译文
关键词
Offensive language detection,Debiasing,Imbalanced data,Twitter
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要