Basque and Spanish Counter Narrative Generation: Data Creation and Evaluation
arxiv(2024)
摘要
Counter Narratives (CNs) are non-negative textual responses to Hate Speech
(HS) aiming at defusing online hatred and mitigating its spreading across
media. Despite the recent increase in HS content posted online, research on
automatic CN generation has been relatively scarce and predominantly focused on
English. In this paper, we present CONAN-EUS, a new Basque and Spanish dataset
for CN generation developed by means of Machine Translation (MT) and
professional post-edition. Being a parallel corpus, also with respect to the
original English CONAN, it allows to perform novel research on multilingual and
crosslingual automatic generation of CNs. Our experiments on CN generation with
mT5, a multilingual encoder-decoder model, show that generation greatly
benefits from training on post-edited data, as opposed to relying on silver MT
data only. These results are confirmed by their correlation with a qualitative
manual evaluation, demonstrating that manually revised training data remains
crucial for the quality of the generated CNs. Furthermore, multilingual data
augmentation improves results over monolingual settings for structurally
similar languages such as English and Spanish, while being detrimental for
Basque, a language isolate. Similar findings occur in zero-shot crosslingual
evaluations, where model transfer (fine-tuning in English and generating in a
different target language) outperforms fine-tuning mT5 on machine translated
data for Spanish but not for Basque. This provides an interesting insight into
the asymmetry in the multilinguality of generative models, a challenging topic
which is still open to research.
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要