Towards the Primary Platform for
Language Technologies in Europe

Curated Multilingual Language Resources for CEF AT

Short Name: CURLICAT
Name: Curated Multilingual Language Resources for CEF AT
Coordinator: Tamás Váradi, Research Institute for Linguistics
Consortium: Research Institute for Linguistics, Institute for Bulgarian Language “Prof. Lyubomir Andreychin”, University of Zagreb, Polish Academy of Sciences, Academia Romana, Jazykovedný ústav Ľ. Štúra Slovenskej akadémie vied, “Jožef Stefan” Institute
Project Runtime: 1 June 2020 – 31 May 2022
Funded by: European Commission
The overall objective of the Curated Multilingual Language Resources for CEF AT Action is to compile curated datasets in seven languages targeted by the consortium (Bulgarian, Croatian, Hungarian, Polish, Romanian, Slovak and Slovenian) in domains of relevance to the European Digital Service Infrastructures (DSIs) with a view to enhance the Automated Translation.
The prime source of data are the national corpora of the above-mentioned languages. The data will cover domains relevant for some of the CEF DSIs, such as eHealth, Europeana and eGovernment in general. The Action will deliver at least 14 Million sentences (estimated to contain at least 140 Million words) from domains including culture, education, health and science. Moreover, the Action will address the gap in machine translation technology, which crucially depends on the provision of domain specific quality language resources for the under-resourced languages.
By delivering seven large size monolingual datasets, which themselves will facilitate the improvement of the CEF Automated Translation core service platform, the Action will enable international users to access information about the relevant EU Member States, including information about local companies and investment opportunities. Thus, the Action will also support the economic growth in Europe, by supporting the CEF AT core service platform for exchanging information in multiple languages.