A Shocking Amount of the Web is Machine Translated: Insights from Multi-Way Parallelism
Annual Meeting of the Association for Computational Linguistics(2024)
摘要
We show that content on the web is often translated into many languages, andthe low quality of these multi-way translations indicates they were likelycreated using Machine Translation (MT). Multi-way parallel, machine generatedcontent not only dominates the translations in lower resource languages; italso constitutes a large fraction of the total web content in those languages.We also find evidence of a selection bias in the type of content which istranslated into many languages, consistent with low quality English contentbeing translated en masse into many lower resource languages, via MT. Our workraises serious concerns about training models such as multilingual largelanguage models on both monolingual and bilingual data scraped from the web.
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要