Use of Machine Translation to Obtain Labeled Datasets for Resource-Constrained Languages

Budur Emrah,Özçelik Rıza,Güngör Tunga,Potts Christopher

arxiv（2020）

引用 0|浏览9

暂无评分

摘要

The large annotated datasets in NLP are overwhelmingly in English. This is an obstacle to progress for other languages. Unfortunately, obtaining new annotated resources for each task in each language would be prohibitively expensive. At the same time, commercial machine translation systems are now robust. Can we leverage these systems to translate English-language datasets automatically? In this paper, we offer a positive response to this for natural language inference (NLI) in Turkish. We translated two large English NLI datasets into Turkish and had a team of experts validate their quality. As examples of the new issues that these datasets help us address, we assess the value of Turkish-specific embeddings and the importance of morphological parsing for developing robust Turkish NLI models.

查看译文

关键词

machine translation,obtain labeled datasets,languages,resource-constrained

AI 理解论文

溯源树

样例

生成溯源树，研究论文发展脉络

Chat Paper

正在生成论文摘要