Authors:
(1) Martyna Wiącek, Institute of Computer Science, Polish Academy of Sciences;
(2) Piotr Rybak, Institute of Computer Science, Polish Academy of Sciences;
(3) Łukasz Pszenny, Institute of Computer Science, Polish Academy of Sciences;
(4) Alina Wróblewska, Institute of Computer Science, Polish Academy of Sciences.
Editor's note: This is Part 3 of 10 of a study on improving the evaluation and comparison of tools used in natural language preprocessing. Read the rest below.
Abstract and 1. Introduction and related works
2.2. Online benchmarking system
The benchmarking system comprises three main parts: a data repository, a submission and evaluation system, and a leaderboard. The data repository provides descriptions of NLPre tasks, datasets, and evaluation metrics, as well as links to the datasets.
The model submission and evaluation system allows the researchers to evaluate a new model by submitting its predictions for the test sets of raw sentences. It is mandatory to upload predictions for all provided test sets for a given tagset; however, it is possible to participate in an evaluation for only one tagset and only for a selected range of tasks.
The leaderboard is a tabular display of the performance of all submissions with their results for each dataset and tagset. The results for the evaluated model and its rank are displayed in the leaderboard provided the submitter confirms their publication.
The benchmarking system is implemented as a web-based application in Python using Django framework. This framework allows quite an easy implementation of MVC design pattern. Moreover, it offers access to the administrator panel, which can be very useful in the custom configuration of the benchmark. The submission scores are stored in a local SQLite database and the submissions are stored in .zip files in a designated directory. The results from the leaderboard are conveniently accessible via an API.
This paper is available on arxiv under CC BY-NC-SA 4.0 DEED license.