Authors:
(1) Martyna Wiącek, Institute of Computer Science, Polish Academy of Sciences;
(2) Piotr Rybak, Institute of Computer Science, Polish Academy of Sciences;
(3) Łukasz Pszenny, Institute of Computer Science, Polish Academy of Sciences;
(4) Alina Wróblewska, Institute of Computer Science, Polish Academy of Sciences.
Editor's note: This is Part 1 of 10 of a study on improving the evaluation and comparison of tools used in natural language preprocessing. Read the rest below.
Abstract and 1. Introduction and related works
2.2. Online benchmarking system
With the advancements of transformer-based architectures, we observe the rise of natural language preprocessing (NLPre) tools capable of solving preliminary NLP tasks (e.g. tokenisation, part-of-speech tagging, dependency parsing, or morphological analysis) without any external linguistic guidance. It is arduous to compare novel solutions to well-entrenched preprocessing toolkits, relying on rule-based morphological analysers or dictionaries. Aware of the shortcomings of existing NLPre evaluation approaches, we investigate a novel method of reliable and fair evaluation and performance reporting. Inspired by the GLUE benchmark, the proposed language-centric benchmarking system enables comprehensive ongoing evaluation of multiple NLPre tools, while credibly tracking their performance. The prototype application is configured for Polish and integrated with the thoroughly assembled NLPre-PL benchmark. Based on this benchmark, we conduct an extensive evaluation of a variety of Polish NLPre systems. To facilitate the construction of benchmarking environments for other languages, e.g. NLPre-GA for Irish or NLPre-ZH for Chinese, we ensure full customization of the publicly released source code of the benchmarking system. The links to all the resources (deployed platforms, source code, trained models, datasets etc.) can be found on the project website: https://sites.google.com/view/nlpre-benchmark.
Keywords: benchmarking, leaderboard, segmentation, POS tagging, dependency parsing, Polish
Morphosyntactic features predicted by part-ofspeech (POS) taggers and dependency parsers underlie various downstream tasks, including but not limited to sentiment analysis (Sun et al., 2019), relation extraction (Zhang et al., 2018; Vashishth et al., 2018; Guo et al., 2019), semantic role labelling (Wang et al., 2019; Kasai et al., 2019), question answering (Khashabi et al., 2018), or machine translation (Chen et al., 2017; Zhang et al., 2019). These underlying tasks may therefore be referred to as natural language preprocessing (NLPre) tasks, as they precede the advanced NLP tasks. Since the quality of morphosyntactic predictions has a crucial impact on the performance of downstream tasks (Sachan et al., 2021), it is prudent to employ the best existing NLPre tools to predict the proper linguistic features. We are equipped with various NLPre methods, ranging from rule-based tools with hand-crafted grammars (e.g. Crouch et al., 2011), through statistical systems (e.g. Nivre, 2009; McDonald et al., 2005; Straka et al., 2016), neural systems supported by pre-trained language models (e.g. Qi et al., 2020; Nguyen et al., 2021a) to large language models (LLM Ouyang et al., 2022).
In the context of intrinsically evaluating NLPre tools and reporting their performance, a variety of approaches have been proposed, e.g. shared task, performance table, and progress repository. The main goal of a shared task is to comprehensively evaluate participating systems on the released datasets using the carefully defined evaluation methodology. Numerous NLPre shared tasks have been organised so far (e.g. Buchholz and Marsi, 2006; Seddah et al., 2013; Zeman et al., 2017, 2018), and they undoubtedly boosted the development of NLPre. While widely favoured, shared tasks are questionable as a complete and up-todate source of knowledge about NLPre progress. First, they scrutinise only solutions propounded in the current contest and do not include systems participating in the previous editions or possible future ones. Second, as shared tasks are organised sporadically, their results are not revised and may quickly become outdated. Certainly, the datasets released for shared tasks can be reused in experiments involving novel tools. The results of such experiments can be reported in independent scientific publications. Nonetheless, these publications are widely scattered, lacking a centralised platform for systematically tracking the ongoing NLPre progress with respect to a particular language.
The results of a new or upgraded NLPre tool are typically reported in performance tables (e.g. Stanza[1] or Trankit[2]). Such tables provide information about the quality of the tool in preprocessing a set of languages. The performance tables, however, often lack comparison with other systems trained for these particular languages. Additionally, as NL Pre systems may be trained on different dataset releases (e.g. of Universal Dependencies), comparing their performance tables is not conclusive.
Information about trends and progress in NLP research is usually collected in public repositories such as Papers with Code[3] or NLP-progress[4]. These repositories contain a repertoire of datasets for common NLP tasks, e.g. dependency parsing and POS tagging, and rankings of models trained and tested on these datasets. They are open to contributing new datasets and results, which, to ensure their credibility, originate from published and linked scientific papers. However, cutting-edge yet unpublished results of a new or upgraded NLPre system are not eligible to report. NLPre tasks are accompanied by datasets mostly in English, raising the problem of language unrepresentation of the repositories. Last but not least, the Papers with Code repository is prone to abuse. After logging in, one can add new results and link them with irrelevant papers as well as edit existing results. The fraudulent results are publicised immediately.
Despite yielding valuable information about the progress in NLPre, the mentioned evaluation approaches also reveal shortcomings, e.g. outdated and incomplete outcomes, lack of cross-system comparison, disregarding some systems, risk of result manipulation and absence of a language-centring perspective.
Following standard procedures in NLP research, we propose to robustly and fairly evaluate NLPre tools using the benchmarking method that allows for the evaluation of NLP models’ performance and progress. NLP benchmarks are coupled with leaderboards that report and update model performance on the benchmark tasks, e.g. GLUE (Wang et al., 2018), XTREME (Hu et al., 2020), GEM (Gehrmann et al., 2021). The conventional benchmarking approach may be dynamically enhanced, exemplified by the Dynabench platform (Kiela et al., 2021), which enables users to augment the benchmark data by inputting custom examples. This humanand-model-in-the-loop benchmarking scenario appears promising for NLU tasks. Nevertheless, it may not be effective in the case of NLPre, as annotating credible examples of syntactic trees or morphological features requires expert knowledge. Finding multiple experts among casual users can be a serious obstacle, we thus implement our system in tune with the standard benchmarking method.
To our knowledge, benchmarking hasn’t been used to rank NLPre systems, even if it is valuable and desired by the community creating treebanks or designing advanced NLP pipelines. Our NLPre benchmarking approach fills this gap. The proposed online benchmarking system automatically assesses submitted predictions of NLPre systems and publishes their performance ranking on a public scoreboard (see Section 2.2). The system is language-centric and tagset-agnostic, enables comprehensive and credible evaluation and constitutes an up-to-date source of information on NLPre progress for a particular language. Unlike similar platforms, e.g. Codalab (Pavao et al., 2022), the NLPre benchmarking system is fully configurable and easy to set up, allowing users to establish an evaluation environment for any language. Additionally, it can be self-hosted, making it convenient for developers and researchers working with a particular language to have it accessible on a local server.
To justify the use of the benchmarking technique for NLPre tasks, we conduct empirical research in a challenging scenario with Polish as an example language. In the case of Polish, one dominant hurdle arises – the discrepancies between different tagsets, annotation schemes and datasets utilised for training disparate systems preclude their direct comparison. We thus standardise the training and evaluation of NLPre systems on a new performance benchmark for Polish, hereafter NLPre-PL (see Section 3). It consists of a predefined set of NLPre tasks and reformulated versions of existing Polish datasets. Section 4 outlines our robust and reliable evaluation of the selected NLPre systems on the NLPre-PL benchmark. According to our knowledge, no evaluation experiments have been carried out in Polish to compare the performance of off-the-shelf LLMs, neural NLPre systems and established tagging disambiguators due to the lack of a coherent evaluation environment.
This work makes a tripartite contribution encompassing novelty, research, and development underpinned by an open-source ethos. (1) We propose a novel language-oriented benchmarking approach to evaluate and rank NLPre systems. (2) We conduct a scientific evaluation of the proposed approach in the non-trivial Polish language scenario on the assembled NLPre-PL benchmark. (3) We publish online benchmarking platforms for three distinct languages: Polish[5], Chinese[6], and Irish[7], and release the benchmarking system’s source code as open-source.
This paper is available on arxiv under CC BY-NC-SA 4.0 DEED license.
[1] https://stanfordnlp.github.io/stanza/performance.html (UD v2.8)
[2] https://trankit.readthedocs.io/en/latest/performance. html#universal-dependencies-v2-5 (UD v2.5)
[3] https://paperswithcode.com
[4] http://nlpprogress.com
[5] https://nlpre-pl.clarin-pl.eu
[6] https://nlpre-zh.clarin-pl.eu
[7] https://nlpre-ga.clarin-pl.eu