Authors:
(1) Martyna Wiącek, Institute of Computer Science, Polish Academy of Sciences;
(2) Piotr Rybak, Institute of Computer Science, Polish Academy of Sciences;
(3) Łukasz Pszenny, Institute of Computer Science, Polish Academy of Sciences;
(4) Alina Wróblewska, Institute of Computer Science, Polish Academy of Sciences.
Editor's note: This is Part 4 of 10 of a study on improving the evaluation and comparison of tools used in natural language preprocessing. Read the rest below.
Abstract and 1. Introduction and related works
2.2. Online benchmarking system
We acknowledge the need to configure similar evaluation environments for other languages to promote linguistic diversity within the worldwide NLP community and to support local NLP communities working on a particular language. To ensure that, we publish a .yaml file that enables easy management of datasets, tagset, and metrics included in the benchmark. The content of all subpages can be modified using a WYSIWYG editor within the application. This setting ensures quite a low entry level for setting up the platform, with minimal changes required.
As a standard feature, we include pre-defined descriptions for the prevalent NLPre tasks. Those can be modified via either configuration files or the administrator panel. Additionally, we supply a default evaluation script, but users are free to provide their own customised code.
To show the capabilities of the benchmarking system, we set up a prototype for Polish (Figure 1). NLPre-PL is described in detail in Section 3. To support our claim that the system is language agnostic, we set up NLPre-GA for Irish and NLPreZH for Chinese. The choice of those languages is not arbitrary; our objective is to demonstrate the capability of the platform in evaluating diverse languages, including those based on non-Latin scripts. In setting up said benchmarking systems we use existing UDv2.9 treebanks: UD_Chinese-GSD (Shen et al., 2019) and UD_Irish-IDT (Lynn et al., 2015) and available up-to-date models, trained on these treebanks. The selection of models mirrors the criteria applied in this work regarding the evaluation of Polish, that is: COMBO, Stanza, SpaCy, UDPipe, and Trankit. If the specific model is not available for UDv2.9, we train it from scratch on the datasets linked above.
This paper is available on arxiv under CC BY-NC-SA 4.0 DEED license.