CURLICAT

The fourth annual ELG conference

JOINING THE EUROPEAN LANGUAGE GRID:

Together Towards Digital Language Equality

8/9 June 2022
Brussels, Belgium
Hybrid conference

Berlin Skyline — © Adobe Stock – Sergij Figurnyi

Project Expo

Project Profile

Project abbreviation: CURLICAT

Project name: Curated Multilingual Language Resources for CEF AT

Project coordinator: Tamás Váradi, Research Institute for Linguistics, Budapest, Hungary

Project consortium:

Research Institute for Linguistics (RIL), Budapest, Hungary
Institute for Bulgarian Language "Prof. Lyubomir Andreychin" (IBL), Sofia, Bulgaria
University of Zagreb, Faculty of Humanities and Social Sciences (FFZG), Zagreb, Croatia
Institute of Computer Science, Polish Academy of Sciences (IPI-PAN), Warsaw, Poland
Institutul de Cercetari pentru Inteligenta Artificiala, Academia Romana (RACAI), Bucarest, Romania
Jazykovedný ústav Ľ. Štúra Slovenskej akadémie vied (LSIL), Bratislava, Slovakia
"Jožef Stefan" Institute (JSI), Ljubljana, Slovenia

Funding: CEF Telecomm Programme

Project duration: 2020-06-01 – 2022-11-30

Main key words: Language Resources, National Corpora, Machine Translation

Background of the research topic: Large LR collecting campaigns, e.g. ELRC and ELG, did not have access to data existing in national/reference corpora for seven languages of the consortium. With this Action these data are not just made available, but are also carefully curated: processed, annotated and enriched with metadata.

Goal of the project: Facilitate the development of NMT systems with the usage of seven monolingual comparable corpora in four domains relevant for application of CEF.AT in different DSIs.

Project abstract: The overall objective of the Curated Multilingual Language Resources for CEF AT Action is to compile curated datasets in seven languages targeted by the consortium (Bulgarian, Croatian, Hungarian, Polish, Romanian, Slovak and Slovenian) in domains of relevance to the European Digital Service Infrastructures (DSIs) with a view to enhance the Automated Translation. The prime source of data are the national corpora of the above-mentioned languages. The data will cover domains relevant for some of the CEF DSIs, such as eHealth, Europeana and eGovernment in general. The Action will deliver at least 14 Million sentences (estimated to contain at least 140 Million words) from domains including culture, science, health and economy/finance. Moreover, the Action will address the gap in machine translation technology, which crucially depends on the provision of domain specific quality language resources for these moderately-resourced languages. By delivering seven large size monolingual datasets, which themselves will facilitate the improvement of the CEF Automated Translation core service platform, the Action will enable international users to access information about the relevant EU Member States, including information about local companies and investment opportunities. Thus, the Action will also support the economic growth in Europe, by supporting the CEF AT core service platform for exchanging information in multiple languages.

Publications:

Tamás Váradi, Svetla Koeva, Martin Yamalov, Marko Tadić, Bálint Sass, Bartłomiej Nitoń, Maciej Ogrodniczuk, Piotr Pęzik, Verginica Barbu Mititelu, Radu Ion, Elena Irimia, Maria Mitrofan, Vasile Păiș, Dan Tufiș, Radovan Garabík, Simon Krek, Andraz Repar, Matjaž Rihtar and Janez Brank (2020) The MARCELL Legislative Corpus, Proceedings of the 12th LREC, pp. 3761‑3768.
Svetla Koeva, Nikola Obreshkov and Martin Yalamov (2020) Natural Language Processing Pipeline to Annotate Bulgarian Legislative Documents, Proceedings of the 12th LREC, pp. 6988‑6994.
Dan Tufiș, Maria Mitrofan, Vasile Păiș, Radu Ion and Andrei Coman (2020) Collection and Annotation of the Romanian Legal Corpus, Proceedings of the 12th LREC, pp. 2773‑2777.
Tamás Váradi, Bence Nyéki, Svetla Koeva, Marko Tadić, Vanja Štefanec, Maciej Ogrodniczuk, Bartłomiej Nitoń, Piotr Pęzik, Verginica Barbu Mititelu, Elena Irimia, Maria Mitrofan,Vasile Păiș, Dan Tufiș, Radovan Garabík, Simon Krek, Andraž Repar (2022) Introducing the CURLICAT Corpora: Seven-language Domain Specific Annotated Corpora from Curated Sources, Proceedings of the 13th LREC, (in press).
Tamás Váradi, Marko Tadić, Svetla Koeva, Maciej Ogrodniczuk, Dan Tufiș, Radovan Garabík, Simon Krek, Andraž Repar (2022) Curated Multilingual Language Resources for CEF AT (CURLICAT): Overall View, Proceedings of the 23rd EAMT, Ghent, 2022, pp. 339-340.
Tamás Váradi, Marko Tadić, Svetla Koeva, Maciej Ogrodniczuk, Dan Tufiș, Radovan Garabík, Simon Krek (2022) CURLICAT: Curated Multilingual Language Resources for CEF AT, Proceedings of the 1st NeTTT2022 conference, (in press).