paint-brush
Researchers Challenge AI to Tackle the Toughest Parts of Language Processingby@morphology
New Story

Researchers Challenge AI to Tackle the Toughest Parts of Language Processing

by MorphologyDecember 30th, 2024
Read on Terminal Reader
Read this story w/o Javascript
tldt arrow

Too Long; Didn't Read

Researchers in Poland have developed an open-source tool that improves the evaluation and comparison of AI used in natural language preprocessing.
featured image - Researchers Challenge AI to Tackle the Toughest Parts of Language Processing
Morphology HackerNoon profile picture

Authors:

(1) Martyna Wiącek, Institute of Computer Science, Polish Academy of Sciences;

(2) Piotr Rybak, Institute of Computer Science, Polish Academy of Sciences;

(3) Łukasz Pszenny, Institute of Computer Science, Polish Academy of Sciences;

(4) Alina Wróblewska, Institute of Computer Science, Polish Academy of Sciences.

Editor's note: This is Part 6 of 10 of a study on improving the evaluation and comparison of tools used in natural language preprocessing. Read the rest below.

Abstract and 1. Introduction and related works

  1. NLPre benchmarking

2.1. Research concept

2.2. Online benchmarking system

2.3. Configuration

  1. NLPre-PL benchmark

3.1. Datasets

3.2. Tasks

  1. Evaluation

4.1. Evaluation methodology

4.2. Evaluated systems

4.3. Results

  1. Conclusions
    • Appendices
    • Acknowledgements
    • Bibliographical References
    • Language Resource References

3.2. Tasks

The complete set of NLPre tasks was originally curated for evaluating language systems in the CoNLL shared task 2018 (Zeman et al., 2018). These tasks mainly focus on preliminary text processing, such as tokenisation or divulging morphosyntactic features. We follow the CoNLL task choice and include all these tasks in NLPre-PL.


Segmentation A segmentation task consists in splitting texts into sentences (Sentences), orthographic tokens (Tokens), and syntactic words (Words), the latter being the basic units of morphosyntactic analysis. Segmentation is not a trivial task. In some languages, an orthographic token may be recognised as a multi-word token (multiword for short) combining multiple syntactic words, e.g. in Polish, the token spalibyśmy (Eng. we would sleep) consists of the past participle spali (Eng. slept), the conditional marker by (Eng. would) and the mobile inflection śmy. Since the consistent model of segmentation into words and sentences was used in NKJP1M and PDB-UD, we maintain this data segmentation in NLPre-PL. It is also worth mentioning that the CoNLL format (but not TEI and DAG) allows for annotating orthographic tokens; thus, they are included in the NLPre-PL benchmark.


Tagging A tagging task is the process of identifying parts of speech (i.e. POS tagging) and possibly morphological features (i.e. morphological analysis) of words. It follows a predefined POS tagset. As mentioned in Section 3.1, two tagsets are used in the NLPre-PL datasets: Morfeusz and UD.


Lemmatisation Lemmatisation involves predicting canonical forms of syntactic words. Canonical forms are conventionally established identifiers of lexemes (i.e. sets of inflectionally related syntactic words). Since Polish is a fusional language with a large number of inflected words, lemmatisation is an important task, albeit not trivial, e.g. the lemma of kluczy can be either the infinitive kluczyć (Eng. to weave) or the noun klucz (Eng. a key).


Dependency parsing Dependency parsing is the process of automatically predicting the syntactic structure of an input sentence. A dependency structure is a labelled directed tree with nodes corresponding to syntactic words and edges between these words specifying dependency relations.


This paper is available on arxiv under CC BY-NC-SA 4.0 DEED license.