TeDDi Sample: Text Data Diversity Sample for Language Comparison and Multilingual NLP.

Steven Moran,Christian Bentz,Ximena Gutierrez-Vasques,Olga Pelloni,Tanja Samardzic

International Conference on Language Resources and Evaluation (LREC)（2022）

引用 0|浏览10

暂无评分

摘要

We present the TeDDi sample, a diversity sample of text data for language comparison and multilingual Natural Language Processing. The TeDDi sample currently features 89 languages based on the typological diversity sample in the World Atlas of Language Structures. It consists of more than 20k texts and is accompanied by open-source corpus processing tools. The aim of TeDDi is to facilitate text-based quantitative analysis of linguistic diversity. We describe in detail the TeDDi sample, how it was created, data availability, and its added value through for NLP and linguistic research.

查看译文

关键词

Corpora, Quantitative Typology, Language Diversity, Language Documentation

AI 理解论文

溯源树

样例

生成溯源树，研究论文发展脉络

Chat Paper

正在生成论文摘要