谷歌浏览器插件
订阅小程序
在清言上使用

A Shocking Amount of the Web is Machine Translated: Insights from Multi-Way Parallelism

Brian Thompson, Mehak Preet Dhaliwal, Peter Frisch,Tobias Domhan,Marcello Federico

Annual Meeting of the Association for Computational Linguistics(2024)

引用 0|浏览9
暂无评分
摘要
We show that content on the web is often translated into many languages, andthe low quality of these multi-way translations indicates they were likelycreated using Machine Translation (MT). Multi-way parallel, machine generatedcontent not only dominates the translations in lower resource languages; italso constitutes a large fraction of the total web content in those languages.We also find evidence of a selection bias in the type of content which istranslated into many languages, consistent with low quality English contentbeing translated en masse into many lower resource languages, via MT. Our workraises serious concerns about training models such as multilingual largelanguage models on both monolingual and bilingual data scraped from the web.
更多
查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要