paint-brush
New Framework Promises to Train AI to Better Understand Hard-to-Grasp Languages Like Polishby@morphology
112 reads

New Framework Promises to Train AI to Better Understand Hard-to-Grasp Languages Like Polish

by Morphology
Morphology HackerNoon profile picture

Morphology

@morphology

Morphology sculpts meaning and structure, fostering a sustainable understanding of...

December 30th, 2024
Read on Terminal Reader
Read this story in a terminal
Print this story
Read this story w/o Javascript
Read this story w/o Javascript
tldt arrow

Too Long; Didn't Read

Researchers in Poland have developed an open-source tool that improves the evaluation and comparison of AI used in natural language preprocessing.
featured image - New Framework Promises to Train AI to Better Understand Hard-to-Grasp Languages Like Polish
1x
Read by Dr. One voice-avatar

Listen to this story

Morphology HackerNoon profile picture
Morphology

Morphology

@morphology

Morphology sculpts meaning and structure, fostering a sustainable understanding of the world's diversity.

About @morphology
LEARN MORE ABOUT @MORPHOLOGY'S
EXPERTISE AND PLACE ON THE INTERNET.
0-item

STORY’S CREDIBILITY

Academic Research Paper

Academic Research Paper

Part of HackerNoon's growing list of open-source research papers, promoting free access to academic material.

Authors:

(1) Martyna Wiącek, Institute of Computer Science, Polish Academy of Sciences;

(2) Piotr Rybak, Institute of Computer Science, Polish Academy of Sciences;

(3) Łukasz Pszenny, Institute of Computer Science, Polish Academy of Sciences;

(4) Alina Wróblewska, Institute of Computer Science, Polish Academy of Sciences.

Editor's note: This is Part 5 of 10 of a study on improving the evaluation and comparison of tools used in natural language preprocessing. Read the rest below.

Abstract and 1. Introduction and related works

  1. NLPre benchmarking

2.1. Research concept

2.2. Online benchmarking system

2.3. Configuration

  1. NLPre-PL benchmark

3.1. Datasets

3.2. Tasks

  1. Evaluation

4.1. Evaluation methodology

4.2. Evaluated systems

4.3. Results

  1. Conclusions
    • Appendices
    • Acknowledgements
    • Bibliographical References
    • Language Resource References

3. NLPre-PL benchmark

3.1. Datasets

Table 1: Summary of source datasets (NKJP1M and PDB-UD) and NLPre-PL Datasets (in tokens). Explanations: POS – the part-of-speech tagset; DEP – the dependency schema; Avg. t/s – the average number of tokens per sentence

Table 1: Summary of source datasets (NKJP1M and PDB-UD) and NLPre-PL Datasets (in tokens). Explanations: POS – the part-of-speech tagset; DEP – the dependency schema; Avg. t/s – the average number of tokens per sentence


NKJP1M (Przepiórkowski et al., 2018) The NKJP1M subcorpus of the Polish National Corpus (Przepiórkowski et al., 2012) is manually annotated according to the NKJP tagset (Szałkiewicz and Przepiórkowski, 2012) and afterwards modified in line with the Morfeusz tagset (Woliński, 2019). This balanced subset of thematic- and genre-diverse texts and transcriptions is used to train Polish POS taggers. NKJP1M is maintained in two formats: TEI[8] and DAG.[9] These two formats are accepted by older NLPre tools but not modern ones. We thus convert NKJP1M to the CoNLL-X format (Buchholz and Marsi, 2006) preserving the original segmentation, POS tags and morphological features (i.e. the Morfeusz tagset), and to the CoNLL-U format10 with UD tags, Morfeusz tags (XPOS) and UD morphological features.


Since there is no generally accepted split of NKJP1M into training, development and testing subsets, we uniformly divide NKJP1M in all formats (i.e. DAG, TEI, CoNLL-X and CoNLL-U) pursuant to the formulated splitting heuristics. Each document in the subcorpus contains multiple paragraphs of continuous textual data. To avoid possible information leakage, we treat each such paragraph as an indivisible unit. To ensure that the subsets include paragraphs of varying length, we investigate the distribution over the number of segments in each paragraph. Since it is akin to Gaussian distribution, we decide to not exclude any data, and we divide the paragraphs into K = 10 buckets of roughly similar size and then sample from them with respective ratios of 0.8:0.1:0.1 (corresponding to train, dev, and test subsets). This data selection technique assures similar distribution of segments number per paragraph in three subsets, hereafter byName. For creating our second split, hereafter byType, we consider the type of document a paragraph belongs to. We first group paragraphs into categories equal to the document types, and then we repeat the above-mentioned procedure per category (see the summary of NKJP1M and data splits in Table 1). PDB-UD (Wróblewska, 2018) Polish Dependency Bank is the largest collection of Polish sentences manually annotated with dependency trees and afterwards converted into UD representations in line with the UD annotation schema (de Marneffe et al., 2021). PDB-UD slightly correlates with NKJP1M, i.e., a subset of the PDB-UD sentences comes from NKJP1M, and the language-specific tags (XPOS) in PDB-UD match the Morfeusz tagset. PDB-UD is typically used to train NLPre systems for Polish. In NLPre-PL, we use the original PDB-UD data without any modifications and its standard split (see the statistical summary of PDB-UD in Table 1).


This paper is available on arxiv under CC BY-NC-SA 4.0 DEED license.


[8] http://nlp.ipipan.waw.pl/TEI4NKJP.


[9] https://github.com/kawu/concraft-pl#data-format


[10] https://universaldependencies.org/format.html

L O A D I N G
. . . comments & more!

About Author

Morphology HackerNoon profile picture
Morphology@morphology
Morphology sculpts meaning and structure, fostering a sustainable understanding of the world's diversity.

TOPICS

THIS ARTICLE WAS FEATURED IN...

Arweave
Read on Terminal Reader
Read this story in a terminal
 Terminal
Read this story w/o Javascript
Read this story w/o Javascript
 Lite
Hackernoon
X
Threads
Bsky
X REMOVE AD