Project Profile

Project abbreviation: TurkuParaC

Project name: Turku Paraphrase Corpus

Project coordinator: Filip Ginter

Project consortium: Bruno Kessler Foundation (FBK)

Funding: ELG, Academy of Finland

Project duration: ELG: 08/2020-07/2021; Academy of Finland: 09/2020-08/2023

Main key words: paraphrase, Finnish, Swedish, deep learning, corpus creation, natural language understanding

Background of the research topic: Paraphrase is an important target for natural language understanding, as it requires the models to identify shared meaning despite vastly differing wording.

Goal of the project: The goal is to create a large-scale manually annotated paraphrase corpus suitable for paraphrase model training and evaluation, as well as to develop machine learning models for paraphrase detection.

Project abstract: The project builds a large dataset of Finnish paraphrase pairs accompanied by a small test set of Swedish paraphrases. The paraphrases are selected and classified manually, so as to minimize lexical overlap, and provide examples that are maximally structurally and lexically different. The objective is to create a dataset which is challenging and better tests the capabilities of natural language understanding. An important feature of the data is that most paraphrase pairs are distributed in their document context. The primary application for the dataset is the development and evaluation of deep language models, and representation learning in general.


