Authors:
(1) Martyna Wiącek, Institute of Computer Science, Polish Academy of Sciences;
(2) Piotr Rybak, Institute of Computer Science, Polish Academy of Sciences;
(3) Łukasz Pszenny, Institute of Computer Science, Polish Academy of Sciences;
(4) Alina Wróblewska, Institute of Computer Science, Polish Academy of Sciences.
Editor's note: This is Part 5 of 10 of a study on improving the evaluation and comparison of tools used in natural language preprocessing. Read the rest below.
Abstract and 1. Introduction and related works
2.2. Online benchmarking system
NKJP1M (Przepiórkowski et al., 2018) The NKJP1M subcorpus of the Polish National Corpus (Przepiórkowski et al., 2012) is manually annotated according to the NKJP tagset (Szałkiewicz and Przepiórkowski, 2012) and afterwards modified in line with the Morfeusz tagset (Woliński, 2019). This balanced subset of thematic- and genre-diverse texts and transcriptions is used to train Polish POS taggers. NKJP1M is maintained in two formats: TEI[8] and DAG.[9] These two formats are accepted by older NLPre tools but not modern ones. We thus convert NKJP1M to the CoNLL-X format (Buchholz and Marsi, 2006) preserving the original segmentation, POS tags and morphological features (i.e. the Morfeusz tagset), and to the CoNLL-U format10 with UD tags, Morfeusz tags (XPOS) and UD morphological features.
Since there is no generally accepted split of NKJP1M into training, development and testing subsets, we uniformly divide NKJP1M in all formats (i.e. DAG, TEI, CoNLL-X and CoNLL-U) pursuant to the formulated splitting heuristics. Each document in the subcorpus contains multiple paragraphs of continuous textual data. To avoid possible information leakage, we treat each such paragraph as an indivisible unit. To ensure that the subsets include paragraphs of varying length, we investigate the distribution over the number of segments in each paragraph. Since it is akin to Gaussian distribution, we decide to not exclude any data, and we divide the paragraphs into K = 10 buckets of roughly similar size and then sample from them with respective ratios of 0.8:0.1:0.1 (corresponding to train, dev, and test subsets). This data selection technique assures similar distribution of segments number per paragraph in three subsets, hereafter byName. For creating our second split, hereafter byType, we consider the type of document a paragraph belongs to. We first group paragraphs into categories equal to the document types, and then we repeat the above-mentioned procedure per category (see the summary of NKJP1M and data splits in Table 1). PDB-UD (Wróblewska, 2018) Polish Dependency Bank is the largest collection of Polish sentences manually annotated with dependency trees and afterwards converted into UD representations in line with the UD annotation schema (de Marneffe et al., 2021). PDB-UD slightly correlates with NKJP1M, i.e., a subset of the PDB-UD sentences comes from NKJP1M, and the language-specific tags (XPOS) in PDB-UD match the Morfeusz tagset. PDB-UD is typically used to train NLPre systems for Polish. In NLPre-PL, we use the original PDB-UD data without any modifications and its standard split (see the statistical summary of PDB-UD in Table 1).
This paper is available on arxiv under CC BY-NC-SA 4.0 DEED license.
[8] http://nlp.ipipan.waw.pl/TEI4NKJP.
[9] https://github.com/kawu/concraft-pl#data-format
[10] https://universaldependencies.org/format.html