Towards the Primary Platform for
Language Technologies in Europe

Textual paraphrase dataset for deep language modeling

Short Name: PARA4DLM
Name: Textual paraphrase dataset for deep language modeling
Coordinator: Filip Ginter, University of Turku
Project Runtime: August 2020 – July 2021
Funded by: European Language Grid
The project gathers a large dataset of Finnish and Swedish paraphrases. The paraphrases are selected and classified manually, so as to minimize lexical overlap, and provide examples that are maximally structurally and lexically different. The primary application for the dataset is the development and evaluation of deep language models, and representation learning in general.