Introducing the CURLICAT Corpora: Seven-language Domain Specific Annotated Corpora from Curated Sources.

Tamás Váradi,Bence Nyéki,Svetla Koeva,Marko Tadic, Vanja Stefanec,Maciej Ogrodniczuk,Bartlomiej Niton,Piotr Pezik,Verginica Barbu Mititelu,Elena Irimia,Maria Mitrofan,Dan Tufis,Radovan Garabík,Simon Krek,Andraz Repar

International Conference on Language Resources and Evaluation (LREC)（2022）

引用 0|浏览24

暂无评分

摘要

This article presents the current outcomes of the CURLICAT CEF Telecom project, which aims to collect and deeply annotate a set of large corpora from selected domains. The CURLICAT corpus includes 7 monolingual corpora (Bulgarian, Croatian, Hungarian, Polish, Romanian, Slovak and Slovenian) containing selected samples from respective national corpora. These corpora are automatically tokenized, lemmatized and morphologically analysed and the named entities annotated. The annotations are uniformly provided for each language specific corpus while the common metadata schema is harmonised across the languages. Additionally, the corpora are annotated for IATE terms in all languages. The file format is CoNLL-U Plus format, containing the ten columns specific to the CoNLL-U format and three extra columns specific to our corpora as defined by Varadi et al. (2020). The CURLICAT corpora represent a rich and valuable source not just for training NMT models, but also for further studies and developments in machine learning, cross-lingual terminological data extraction and classification.

查看译文

关键词

national corpora, comparable corpora, domain corpora

AI 理解论文

溯源树

样例

生成溯源树，研究论文发展脉络

Chat Paper

正在生成论文摘要