Curated Multilingual Language Resources for CEF AT

The overall objective of the Curated Multilingual Language Resources for CEF AT Action is to compile curated datasets in seven languages targeted by the consortium (Bulgarian, Croatian, Hungarian, Polish, Romanian, Slovak and Slovenian) in domains of relevance to the European Digital Service Infrastructures (DSIs) with a view to enhance the Automated Translation.
The prime source of data are the national corpora of the above-mentioned languages. The data will cover domains relevant for some of the CEF DSIs, such as eHealth, Europeana and eGovernment in general. The Action will deliver at least 14 Million sentences (estimated to contain at least 140 Million words) from domains including culture, education, health and science. Moreover, the Action will address the gap in machine translation technology, which crucially depends on the provision of domain specific quality language resources for the under-resourced languages.
By delivering seven large size monolingual datasets, which themselves will facilitate the improvement of the CEF Automated Translation core service platform, the Action will enable international users to access information about the relevant EU Member States, including information about local companies and investment opportunities. Thus, the Action will also support the economic growth in Europe, by supporting the CEF AT core service platform for exchanging information in multiple languages.