Authors: (1) Martyna Wiącek, Institute of Computer Science, Polish Academy of Sciences; (2) Piotr Rybak, Institute of Computer Science, Polish Academy of Sciences; (3) Łukasz Pszenny, Institute of Computer Science, Polish Academy of Sciences; (4) Alina Wróblewska, Institute of Computer Science, Polish Academy of Sciences. Editor's note: This is Part 7 of 10 of a study on improving the evaluation and comparison of tools used in natural language preprocessing. Read the rest below. Table of Links Abstract and 1. Introduction and related works NLPre benchmarking 2.1. Research concept 2.2. Online benchmarking system 2.3. Configuration NLPre-PL benchmark 3.1. Datasets 3.2. Tasks Evaluation 4.1. Evaluation methodology 4.2. Evaluated systems 4.3. Results Conclusions Appendices Acknowledgements Bibliographical References Language Resource References 4. Evaluation 4.1. Evaluation methodology To maintain the de facto standard to NLPre evaluation, we apply the evaluation measures defined for the CoNLL 2018 shared task and implemented in the official evaluation script.[11] In particular, we focus on F1 and AlignedAccuracy, which is similar to F1 but does not consider possible misalignments in tokens, words, or sentences. In our evaluation process, we follow default training procedures suggested by the authors of the evaluated systems, i.e. we do not conduct any optimal hyperparameter search in favour of leaving the recommended model configuration as-is. We also do not further fine-tune selected models. This paper is available on arxiv under CC BY-NC-SA 4.0 DEED license. [11] https://universaldependencies.org/conll18/conll18_ud_eval.py Authors: (1) Martyna Wiącek, Institute of Computer Science, Polish Academy of Sciences; (2) Piotr Rybak, Institute of Computer Science, Polish Academy of Sciences; (3) Łukasz Pszenny, Institute of Computer Science, Polish Academy of Sciences; (4) Alina Wróblewska, Institute of Computer Science, Polish Academy of Sciences. Authors: Authors: (1) Martyna Wiącek, Institute of Computer Science, Polish Academy of Sciences; (2) Piotr Rybak, Institute of Computer Science, Polish Academy of Sciences; (3) Łukasz Pszenny, Institute of Computer Science, Polish Academy of Sciences; (4) Alina Wróblewska, Institute of Computer Science, Polish Academy of Sciences. Editor's note: This is Part 7 of 10 of a study on improving the evaluation and comparison of tools used in natural language preprocessing. Read the rest below. Editor's note: This is Part 7 of 10 of a study on improving the evaluation and comparison of tools used in natural language preprocessing. Read the rest below. Editor's note: This is Part 7 of 10 of a study on improving the evaluation and comparison of tools used in natural language preprocessing. Read the rest below. Editor's note: This is Part 7 of 10 of a study on improving the evaluation and comparison of tools used in natural language preprocessing. Read the rest below. Table of Links Abstract and 1. Introduction and related works Abstract and 1. Introduction and related works NLPre benchmarking NLPre benchmarking 2.1. Research concept 2.1. Research concept 2.2. Online benchmarking system 2.2. Online benchmarking system 2.3. Configuration 2.3. Configuration NLPre-PL benchmark NLPre-PL benchmark 3.1. Datasets 3.1. Datasets 3.2. Tasks 3.2. Tasks Evaluation Evaluation 4.1. Evaluation methodology 4.1. Evaluation methodology 4.2. Evaluated systems 4.2. Evaluated systems 4.3. Results 4.3. Results Conclusions Appendices Acknowledgements Bibliographical References Language Resource References Conclusions Appendices Acknowledgements Bibliographical References Language Resource References Conclusions Appendices Acknowledgements Bibliographical References Language Resource References Appendices Acknowledgements Bibliographical References Language Resource References 4. Evaluation 4.1. Evaluation methodology To maintain the de facto standard to NLPre evaluation, we apply the evaluation measures defined for the CoNLL 2018 shared task and implemented in the official evaluation script.[11] In particular, we focus on F1 and AlignedAccuracy , which is similar to F1 but does not consider possible misalignments in tokens, words, or sentences. AlignedAccuracy In our evaluation process, we follow default training procedures suggested by the authors of the evaluated systems, i.e. we do not conduct any optimal hyperparameter search in favour of leaving the recommended model configuration as-is. We also do not further fine-tune selected models. This paper is available on arxiv under CC BY-NC-SA 4.0 DEED license. This paper is available on arxiv under CC BY-NC-SA 4.0 DEED license. available on arxiv [11] https://universaldependencies.org/conll18/conll18_ud_eval.py