New Open-Source Platform Is Letting AI Researchers Crack Tough Languages

Authors: (1) Martyna Wiącek, Institute of Computer Science, Polish Academy of Sciences; (2) Piotr Rybak, Institute of Computer Science, Polish Academy of Sciences; (3) Łukasz Pszenny, Institute of Computer Science, Polish Academy of Sciences; (4) Alina Wróblewska, Institute of Computer Science, Polish Academy of Sciences. Editor's note: This is Part 10 of 10 of a study on improving the evaluation and comparison of tools used in natural language preprocessing. Read the rest below. Table of Links Abstract and 1. Introduction and related works NLPre benchmarking 2.1. Research concept 2.2. Online benchmarking system 2.3. Configuration NLPre-PL benchmark 3.1. Datasets 3.2. Tasks Evaluation 4.1. Evaluation methodology 4.2. Evaluated systems 4.3. Results Conclusions Appendices Acknowledgements Bibliographical References Language Resource References 5. Conclusions In this work, we propose a revised approach to NLPre evaluation via benchmarking. This is motivated by the widespread use of the benchmarking technique in other NLP fields on par with the shortcomings of existing NLPre evaluation solutions. We implement said NLPre benchmarking approach as the online system that evaluates the submitted outcome of an NLPre system and updates the associated leaderboard with the results after the submitter’s approval. The benchmarking system is designed to rank NLPre tools available for a given language in a trustworthy environment. The endeavour of defining and enhancing the system’s capabilities is conducted concurrently with the effort to create the NLPre benchmark for Polish that encompasses numerous factors, such as tasks not required in English or diverse tagsets. The NLPre-PL benchmark consists of the predefined NLPre tasks, coupled with two reformulated datasets. The NLPre-PL benchmark, therefore, sets the standard for evaluating the performance of the NLPre tools for Polish, which represents a derivative yet important outcome of our research. In addition to integration into the benchmarking system, NLPre-PL is used to conduct empirical experiments. We perform a robust and extensive comparison of different NLPre methods, including the classical non-neural tools and the modern neural network-based techniques. The results of these experiments on datasets in two tagsets are discussed in detail. The experiments confirm our assumptions that modern architectures obtain better results. Because NLP is a discipline undergoing rapid progress, new NLPre solutions, e.g. multilingual or zero-shot, can be expected in the coming years. These new solutions can be easily tested and compared with the tools evaluated so far in our benchmarking system. Finally, we release the open-source code of the benchmarking system in hopes that this endeavour could be replicated for other languages. To expedite this process, we ensure that the system is fully configurable and language- and tagset-agnostic. The NLPre system, configured for a specified language, can be self-hosted on a chosen server, and the results from the leaderboard are conveniently accessible via an API. We see a potential future application of our system to the UD repository, where for 141 languages, there are currently 245 treebanks with supposedly discrepant versions of the UD tagset. 6. Appendices 6.1. Infrastructure used We train the models using several types of computational nodes at our disposal, including NVIDIA V100 32GB, NVIDIA GeForce RTX 2080 8GB, NVIDIA GeForce 3070 8GB and Intel Xeon E5-2697 processor. Since we do not perform hyperparameter tuning, this should not impact our results. 6.2. Further results of experiments Herein, we present a comprehensive depiction of our experimental findings as they are displayed on the NLPre-PL leaderboard. In Table 5, we present the full results of the evaluation of the selected models on the Morfeuszbased datasets byName and byType. These results are provided for all available tasks that can be performed on the above-mentioned datasets. As NKJP1M datasets contain no syntantic trees, it is thus impossible to test the dependency parsing task that rely on these trees and measure UAS, LAS, CLAS, MLAS and BLEX. In Table 6, we present the results of the evaluation of the selected models on the UD-based datasets byName, byType, and PDB. This table contains the results of segmentation, tagging, and lemmatization tasks. Table 7 is a continuation of Table 6 and it contains the results for the same tagset and dataset on the dependency parsing task. 7. Acknowledgements This work was supported by the European Regional Development Fund as a part of 2014–2020 Smart Growth Operational Programme, CLARIN — Common Language Resources and Technology Infrastructure (project no. POIR.04.02.00-00C002/19) and DARIAH-PL — Digital Research Infrastructure for the Arts and Humanities (project no. POIR.04.02.00-00-D006/20-0). We gratefully acknowledge Poland’s high-performance computing infrastructure PLGrid (HPC Centers: ACK Cyfronet AGH) for providing computer facilities and support within computational grant no. PLG/2022/015872. 8. Bibliographical References Piotr Bojanowski, Edouard Grave, Armand Joulin, and Tomas Mikolov. 2017. Enriching word vectors with subword information. Transactions of the Association for Computational Linguistics, 5:135–146. Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel Ziegler, Jeffrey Wu, Clemens Winter, Chris Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott Gray, Benjamin Chess, Jack Clark, Christopher Berner, Sam McCandlish, Alec Radford, Ilya Sutskever, and Dario Amodei. 2020. Language Models are Few-Shot Learners. In Advances in Neural Information Processing Systems, volume 33, pages 1877–1901. Curran Associates, Inc. Sabine Buchholz and Erwin Marsi. 2006. CoNLL-X shared task on multilingual dependency parsing. In Proceedings of the Tenth Conference on Computational Natural Language Learning (CoNLLX), pages 149–164, New York City. Association for Computational Linguistics. Kehai Chen, Tiejun Zhao, Muyun Yang, and Lemao Liu. 2017. Translation prediction with source dependency-based context representation. Proceedings of the AAAI Conference on Artificial Intelligence, 31(1). Alexis Conneau, Kartikay Khandelwal, Naman Goyal, Vishrav Chaudhary, Guillaume Wenzek, Francisco Guzmán, Edouard Grave, Myle Ott, Luke Zettlemoyer, and Veselin Stoyanov. 2020. Unsupervised cross-lingual representation learning at scale. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 8440–8451, Online. Association for Computational Linguistics. Dick Crouch, Mary Dalrymple, Ronald M. Kaplan, Tracy Holloway King, John Maxwell, and Paula Newman. 2011. XLE Documentation. Palo Alto Research Center. Marie-Catherine de Marneffe, Christopher D. Manning, Joakim Nivre, and Daniel Zeman. 2021. Universal Dependencies. Computational Linguistics, 47(2):255–308. Sebastian Gehrmann, Tosin Adewumi, Karmanya Aggarwal, Pawan Sasanka Ammanamanchi, Anuoluwapo Aremu, Antoine Bosselut, Khyathi Raghavi Chandu, Miruna-Adriana Clinciu, Dipanjan Das, Kaustubh Dhole, Wanyu Du, Esin Durmus, Ondřej Dušek, Chris Chinenye Emezue, Varun Gangal, Cristina Garbacea, Tatsunori Hashimoto, Yufang Hou, Yacine Jernite, Harsh Jhamtani, Yangfeng Ji, Shailza Jolly, Mihir Kale, Dhruv Kumar, Faisal Ladhak, Aman Madaan, Mounica Maddela, Khyati Mahajan, Saad Mahamood, Bodhisattwa Prasad Majumder, Pedro Henrique Martins, Angelina McMillan-Major, Simon Mille, Emiel van Miltenburg, Moin Nadeem, Shashi Narayan, Vitaly Nikolaev, Andre Niyongabo Rubungo, Salomey Osei, Ankur Parikh, Laura Perez-Beltrachini, Niranjan Ramesh Rao, Vikas Raunak, Juan Diego Rodriguez, Sashank Santhanam, João Sedoc, Thibault Sellam, Samira Shaikh, Anastasia Shimorina, Marco Antonio Sobrevilla Cabezudo, Hendrik Strobelt, Nishant Subramani, Wei Xu, Diyi Yang, Akhila Yerukola, and Jiawei Zhou. 2021. The GEM benchmark: Natural language generation, its evaluation and metrics. In Proceedings of the 1st Workshop on Natural Language Generation, Evaluation, and Metrics (GEM 2021), pages 96–120, Online. Association for Computational Linguistics. Alex Graves and Jürgen Schmidhuber. 2005. Framewise phoneme classification with bidirectional lstm and other neural network architectures. Neural Networks, 18(5):602–610. IJCNN 2005. Zhijiang Guo, Yan Zhang, and Wei Lu. 2019. Attention guided graph convolutional networks for relation extraction. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 241–251, Florence, Italy. Association for Computational Linguistics. Junjie Hu, Sebastian Ruder, Aditya Siddhant, Graham Neubig, Orhan Firat, and Melvin Johnson. 2020. XTREME: A massively multilingual multitask benchmark for evaluating cross-lingual generalisation. In Proceedings of the 37th Interna- tional Conference on Machine Learning, volume 119 of Proceedings of Machine Learning Research, pages 4411–4421. PMLR. Jungo Kasai, Dan Friedman, Robert Frank, Dragomir Radev, and Owen Rambow. 2019. Syntax-aware neural semantic role labeling with supertags. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 701–709, Minneapolis, Minnesota. Association for Computational Linguistics. Daniel Khashabi, Tushar Khot, Ashish Sabharwal, and Dan Roth. 2018. Question answering as global reasoning over semantic abstractions. Proceedings of the AAAI Conference on Artificial Intelligence, 32(1). Douwe Kiela, Max Bartolo, Yixin Nie, Divyansh Kaushik, Atticus Geiger, Zhengxuan Wu, Bertie Vidgen, Grusha Prasad, Amanpreet Singh, Pratik Ringshia, Zhiyi Ma, Tristan Thrush, Sebastian Riedel, Zeerak Waseem, Pontus Stenetorp, Robin Jia, Mohit Bansal, Christopher Potts, and Adina Williams. 2021. Dynabench: Rethinking benchmarking in NLP. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 4110– 4124, Online. Association for Computational Linguistics. Witold Kieraś and Marcin Woliński. 2017. Morfeusz 2 – analizator i generator fleksyjny dla języka polskiego. Język Polski, XCVII(1):75–83. Mateusz Klimaszewski and Alina Wróblewska. 2021. COMBO: State-of-the-art morphosyntactic analysis. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pages 50–62, Online and Punta Cana, Dominican Republic. Association for Computational Linguistics. Ryan McDonald, Koby Crammer, and Fernando Pereira. 2005. Online large-margin training of dependency parsers. In Proceedings of the 43rd Annual Meeting of the Association for Computational Linguistics (ACL’05), pages 91–98, Ann Arbor, Michigan. Association for Computational Linguistics. Ines Montani and Matthew Honnibal. 2022. spaCy: Industrial-Strength Natural Language Processing in Python. Version 3.4.1. Robert Mroczkowski, Piotr Rybak, Alina Wróblewska, and Ireneusz Gawlik. 2021. HerBERT: Efficiently pretrained transformerbased language model for Polish. In Proceedings of the 8th Workshop on Balto-Slavic Natural Language Processing, pages 1–10, Kiyv, Ukraine. Association for Computational Linguistics. Minh Van Nguyen, Viet Dac Lai, Amir Pouran Ben Veyseh, and Thien Huu Nguyen. 2021a. Trankit: A light-weight transformer-based toolkit for multilingual natural language processing. In Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: System Demonstrations, pages 80–90, Online. Association for Computational Linguistics. Minh Van Nguyen, Viet Dac Lai, Amir Pouran Ben Veyseh, and Thien Huu Nguyen. 2021b. Trankit: A light-weight transformer-based toolkit for multilingual natural language processing. In Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: System Demonstrations. Joakim Nivre. 2009. Non-projective dependency parsing in expected linear time. In Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP, pages 351–359, Suntec, Singapore. Association for Computational Linguistics. Long Ouyang, Jeff Wu, Xu Jiang, Diogo Almeida, Carroll L. Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, John Schulman, Jacob Hilton, Fraser Kelton, Luke Miller, Maddie Simens, Amanda Askell, Peter Welinder, Paul Christiano, Jan Leike, and Ryan Lowe. 2022. Training language models to follow instructions with human feedback. Adrien Pavao, Isabelle Guyon, Anne-Catherine Letournel, Xavier Baró, Hugo Escalante, Sergio Escalera, Tyler Thomas, and Zhen Xu. 2022. Codalab competitions: An open source platform to organize scientific challenges. Technical report, Université Paris-Saclay. Jonas Pfeiffer, Andreas Rücklé, Clifton Poth, Aishwarya Kamath, Ivan Vulić, Sebastian Ruder, Kyunghyun Cho, and Iryna Gurevych. 2020a. AdapterHub: A framework for adapting transformers. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pages 46–54, Online. Association for Computational Linguistics. Jonas Pfeiffer, Ivan Vulić, Iryna Gurevych, and Sebastian Ruder. 2020b. MAD-X: An AdapterBased Framework for Multi-Task Cross-Lingual Transfer. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 7654–7673, Online. Association for Computational Linguistics. Adam Przepiórkowski, Mirosław Bańko, Rafał L. Górski, and Barbara Lewandowska-Tomaszczyk, editors. 2012. Narodowy Korpus Języka Polskiego. Wydawnictwo Naukowe PWN, Warsaw. Piotr Przybyła. 2022. LAMBO: Layered Approach to Multi-level BOundary identification. Peng Qi, Yuhao Zhang, Yuhui Zhang, Jason Bolton, and Christopher D. Manning. 2020. Stanza: A python natural language processing toolkit for many human languages. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics: System Demonstrations, pages 101–108, Online. Association for Computational Linguistics. Piotr Rybak and Alina Wróblewska. 2018. Semisupervised neural system for tagging, parsing and lematization. In Proceedings of the CoNLL 2018 Shared Task: Multilingual Parsing from Raw Text to Universal Dependencies, pages 45– 54, Brussels, Belgium. Association for Computational Linguistics. Devendra Sachan, Yuhao Zhang, Peng Qi, and William L. Hamilton. 2021. Do syntax trees help pre-trained transformers extract information? In Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume, pages 2647– 2661, Online. Association for Computational Linguistics. Djamé Seddah, Reut Tsarfaty, Sandra Kübler, Marie Candito, Jinho D. Choi, Richárd Farkas, Jennifer Foster, Iakes Goenaga, Koldo Gojenola Galletebeitia, Yoav Goldberg, Spence Green, Nizar Habash, Marco Kuhlmann, Wolfgang Maier, Joakim Nivre, Adam Przepiórkowski, Ryan Roth, Wolfgang Seeker, Yannick Versley, Veronika Vincze, Marcin Woliński, Alina Wróblewska, and Eric Villemonte de la Clergerie. 2013. Overview of the SPMRL 2013 shared task: A cross-framework evaluation of parsing morphologically rich languages. In Proceedings of the Fourth Workshop on Statistical Parsing of Morphologically-Rich Languages, pages 146– 182, Seattle, Washington, USA. Association for Computational Linguistics. Milan Straka, Jan Hajič, and Jana Straková. 2016. UDPipe: Trainable pipeline for processing CoNLL-U files performing tokenization, morphological analysis, POS tagging and parsing. In Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC’16), pages 4290–4297, Portorož, Slovenia. European Language Resources Association (ELRA). Milan Straka and Jana Straková. 2017. Tokenizing, pos tagging, lemmatizing and parsing ud 2.0 with udpipe. In Proceedings of the CoNLL 2017 Shared Task: Multilingual Parsing from Raw Text to Universal Dependencies, pages 88–99, Vancouver, Canada. Association for Computational Linguistics. Kai Sun, Richong Zhang, Samuel Mensah, Yongyi Mao, and Xudong Liu. 2019. Aspect-level sentiment analysis via convolution over dependency tree. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 5679–5688, Hong Kong, China. Association for Computational Linguistics. Łukasz Szałkiewicz and Adam Przepiórkowski. 2012. Anotacja morfoskładniowa. In Adam Przepiórkowski, Mirosław Bańko, Rafał L. Górski, and Barbara Lewandowska-Tomaszczyk, editors, Narodowy Korpus Języka Polskiego, pages 59– 96. Wydawnictwo Naukowe PWN, Warsaw. Shikhar Vashishth, Rishabh Joshi, Sai Suman Prayaga, Chiranjib Bhattacharyya, and Partha Talukdar. 2018. RESIDE: Improving distantlysupervised neural relation extraction using side information. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 1257–1266, Brussels, Belgium. Association for Computational Linguistics. Alex Wang, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel Bowman. 2018. GLUE: A multi-task benchmark and analysis platform for natural language understanding. In Proceedings of the 2018 EMNLP Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP, pages 353–355, Brussels, Belgium. Association for Computational Linguistics. Yufei Wang, Mark Johnson, Stephen Wan, Yifang Sun, and Wei Wang. 2019. How to best use syntax in semantic role labelling. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 5338–5343, Florence, Italy. Association for Computational Linguistics. Jakub Waszczuk. 2012. Harnessing the crf complexity with domain-specific constraints. the case of morphosyntactic tagging of a highly inflected language. In Proceedings of COLING 2012, pages 2789–2804. Jakub Waszczuk, Witold Kieraś, and Marcin Woliński. 2018. Morphosyntactic disambiguation and segmentation for historical polish with graph-based conditional random fields. In International Conference on Text, Speech, and Dialogue, pages 188–196. Springer. Marcin Woliński. 2014. Morfeusz reloaded. In Proceedings of the Ninth International Conference on Language Resources and Evaluation, LREC 2014, pages 1106–1111. European Language Resources Association (ELRA). Marcin Woliński. 2019. Automatyczna analiza składnikowa języka polskiego. Wydawnictwa Uniwersytetu Warszawskiego, Warsaw. Daniel Zeman, Jan Hajič, Martin Popel, Martin Potthast, Milan Straka, Filip Ginter, Joakim Nivre, and Slav Petrov. 2018. CoNLL 2018 shared task: Multilingual parsing from raw text to Universal Dependencies. In Proceedings of the CoNLL 2018 Shared Task: Multilingual Parsing from Raw Text to Universal Dependencies, pages 1–21, Brussels, Belgium. Association for Computational Linguistics. Daniel Zeman, Martin Popel, Milan Straka, Jan Hajič, Joakim Nivre, Filip Ginter, Juhani Luotolahti, Sampo Pyysalo, Slav Petrov, Martin Potthast, Francis Tyers, Elena Badmaeva, Memduh Gokirmak, Anna Nedoluzhko, Silvie Cinková, Jan Hajič jr., Jaroslava Hlaváčová, Václava Kettnerová, Zdeňka Urešová, Jenna Kanerva, Stina Ojala, Anna Missilä, Christopher D. Manning, Sebastian Schuster, Siva Reddy, Dima Taji, Nizar Habash, Herman Leung, MarieCatherine de Marneffe, Manuela Sanguinetti, Maria Simi, Hiroshi Kanayama, Valeria de Paiva, Kira Droganova, Héctor Martínez Alonso, Çağrı Çöltekin, Umut Sulubacak, Hans Uszkoreit, Vivien Macketanz, Aljoscha Burchardt, Kim Harris, Katrin Marheinecke, Georg Rehm, Tolga Kayadelen, Mohammed Attia, Ali Elkahky, Zhuoran Yu, Emily Pitler, Saran Lertpradit, Michael Mandl, Jesse Kirchner, Hector Fernandez Alcalde, Jana Strnadová, Esha Banerjee, Ruli Manurung, Antonio Stella, Atsuko Shimada, Sookyoung Kwak, Gustavo Mendonça, Tatiana Lando, Rattima Nitisaroj, and Josie Li. 2017. CoNLL 2017 shared task: Multilingual parsing from raw text to Universal Dependencies. In Proceedings of the CoNLL 2017 Shared Task: Multilingual Parsing from Raw Text to Universal Dependencies, pages 1–19, Vancouver, Canada. Association for Computational Linguistics. Meishan Zhang, Zhenghua Li, Guohong Fu, and Min Zhang. 2019. Syntax-enhanced neural machine translation with syntax-aware word representations. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 1151–1161, Minneapolis, Minnesota. Association for Computational Linguistics. Yuhao Zhang, Peng Qi, and Christopher D. Manning. 2018. Graph convolution over pruned dependency trees improves relation extraction. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 2205–2215, Brussels, Belgium. Association for Computational Linguistics. 9. Language Resource References Alexis Conneau and Kartikay Khandelwal and Naman Goyal and Vishrav Chaudhary and Guillaume Wenzek and Francisco Guzmán and Edouard Grave and Myle Ott and Luke Zettlemoyer and Veselin Stoyanov. 2019. XLMRoBERTa. Hugging Face. Grave, Edouard and Bojanowski, Piotr and Gupta, Prakhar and Joulin, Armand and Mikolov, Tomas. 2018. fastText. Facebook. Kłeczek, Dariusz. 2021. Polbert. Hugging Face. Lynn, Teresa and Foster, Jennifer and McGuinness, Sarah and Walsh, Abigail and Phelan, Jason and Scannell, Kevin. 2015. Irish Dependency Treebank (UD Irish-IDT). Universal Dependencies Consortium. PID http://hdl.handle.net/11234/1- 4611. Mroczkowski, Robert and Rybak, Piotr and Wróblewska, Alina and Gawlik, Ireneusz. 2021. HerBERT. Hugging Face. Przepiórkowski, Adam and Bańko, Mirosław and Górski, Rafał L. and Lewandowska-Tomaszczyk, Barbara. 2018. National Corpus of Polish. Institute of Computer Science. Shen, Mo and McDonald, Ryan and Zeman, Daniel and Qi, Peng. 2019. Chinese Dependency Treebank (UD Chinese-GSD). Universal Dependencies Consortium. PID http://hdl.handle.net/11234/1-4611. Wróblewska, Alina. 2018. Polish Dependency Bank (UD Polish-PDB). Universal Dependencies Consortium. PID http://hdl.handle.net/11234/1-5150. This paper is available on arxiv under CC BY-NC-SA 4.0 DEED license. Authors: (1) Martyna Wiącek, Institute of Computer Science, Polish Academy of Sciences; (2) Piotr Rybak, Institute of Computer Science, Polish Academy of Sciences; (3) Łukasz Pszenny, Institute of Computer Science, Polish Academy of Sciences; (4) Alina Wróblewska, Institute of Computer Science, Polish Academy of Sciences. Authors: Authors: (1) Martyna Wiącek, Institute of Computer Science, Polish Academy of Sciences; (2) Piotr Rybak, Institute of Computer Science, Polish Academy of Sciences; (3) Łukasz Pszenny, Institute of Computer Science, Polish Academy of Sciences; (4) Alina Wróblewska, Institute of Computer Science, Polish Academy of Sciences. Editor's note: This is Part 10 of 10 of a study on improving the evaluation and comparison of tools used in natural language preprocessing. Read the rest below. Editor's note: This is Part 10 of 10 of a study on improving the evaluation and comparison of tools used in natural language preprocessing. Read the rest below. Editor's note: This is Part 10 of 10 of a study on improving the evaluation and comparison of tools used in natural language preprocessing. Read the rest below. Editor's note: This is Part 10 of 10 of a study on improving the evaluation and comparison of tools used in natural language preprocessing. Read the rest below. Table of Links Abstract and 1. Introduction and related works Abstract and 1. Introduction and related works NLPre benchmarking NLPre benchmarking 2.1. Research concept 2.1. Research concept 2.2. Online benchmarking system 2.2. Online benchmarking system 2.3. Configuration 2.3. Configuration NLPre-PL benchmark NLPre-PL benchmark 3.1. Datasets 3.1. Datasets 3.2. Tasks 3.2. Tasks Evaluation Evaluation 4.1. Evaluation methodology 4.1. Evaluation methodology 4.2. Evaluated systems 4.2. Evaluated systems 4.3. Results 4.3. Results Conclusions Appendices Acknowledgements Bibliographical References Language Resource References Conclusions Appendices Acknowledgements Bibliographical References Language Resource References Conclusions Appendices Acknowledgements Bibliographical References Language Resource References Appendices Acknowledgements Bibliographical References Language Resource References 5. Conclusions In this work, we propose a revised approach to NLPre evaluation via benchmarking. This is motivated by the widespread use of the benchmarking technique in other NLP fields on par with the shortcomings of existing NLPre evaluation solutions. We implement said NLPre benchmarking approach as the online system that evaluates the submitted outcome of an NLPre system and updates the associated leaderboard with the results after the submitter’s approval. The benchmarking system is designed to rank NLPre tools available for a given language in a trustworthy environment. The endeavour of defining and enhancing the system’s capabilities is conducted concurrently with the effort to create the NLPre benchmark for Polish that encompasses numerous factors, such as tasks not required in English or diverse tagsets. The NLPre-PL benchmark consists of the predefined NLPre tasks, coupled with two reformulated datasets. The NLPre-PL benchmark, therefore, sets the standard for evaluating the performance of the NLPre tools for Polish, which represents a derivative yet important outcome of our research. In addition to integration into the benchmarking system, NLPre-PL is used to conduct empirical experiments. We perform a robust and extensive comparison of different NLPre methods, including the classical non-neural tools and the modern neural network-based techniques. The results of these experiments on datasets in two tagsets are discussed in detail. The experiments confirm our assumptions that modern architectures obtain better results. Because NLP is a discipline undergoing rapid progress, new NLPre solutions, e.g. multilingual or zero-shot, can be expected in the coming years. These new solutions can be easily tested and compared with the tools evaluated so far in our benchmarking system. Finally, we release the open-source code of the benchmarking system in hopes that this endeavour could be replicated for other languages. To expedite this process, we ensure that the system is fully configurable and language- and tagset-agnostic. The NLPre system, configured for a specified language, can be self-hosted on a chosen server, and the results from the leaderboard are conveniently accessible via an API. We see a potential future application of our system to the UD repository, where for 141 languages, there are currently 245 treebanks with supposedly discrepant versions of the UD tagset. 6. Appendices 6.1. Infrastructure used We train the models using several types of computational nodes at our disposal, including NVIDIA V100 32GB, NVIDIA GeForce RTX 2080 8GB, NVIDIA GeForce 3070 8GB and Intel Xeon E5-2697 processor. Since we do not perform hyperparameter tuning, this should not impact our results. 6.2. Further results of experiments Herein, we present a comprehensive depiction of our experimental findings as they are displayed on the NLPre-PL leaderboard. In Table 5, we present the full results of the evaluation of the selected models on the Morfeuszbased datasets byName and byType. These results are provided for all available tasks that can be performed on the above-mentioned datasets. As NKJP1M datasets contain no syntantic trees, it is thus impossible to test the dependency parsing task that rely on these trees and measure UAS, LAS, CLAS, MLAS and BLEX. In Table 6, we present the results of the evaluation of the selected models on the UD-based datasets byName, byType, and PDB. This table contains the results of segmentation, tagging, and lemmatization tasks. Table 7 is a continuation of Table 6 and it contains the results for the same tagset and dataset on the dependency parsing task. 7. Acknowledgements This work was supported by the European Regional Development Fund as a part of 2014–2020 Smart Growth Operational Programme, CLARIN — Common Language Resources and Technology Infrastructure (project no. POIR.04.02.00-00C002/19) and DARIAH-PL — Digital Research Infrastructure for the Arts and Humanities (project no. POIR.04.02.00-00-D006/20-0). We gratefully acknowledge Poland’s high-performance computing infrastructure PLGrid (HPC Centers: ACK Cyfronet AGH) for providing computer facilities and support within computational grant no. PLG/2022/015872. 8. Bibliographical References Piotr Bojanowski, Edouard Grave, Armand Joulin, and Tomas Mikolov. 2017. Enriching word vectors with subword information. Transactions of the Association for Computational Linguistics, 5:135–146. Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel Ziegler, Jeffrey Wu, Clemens Winter, Chris Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott Gray, Benjamin Chess, Jack Clark, Christopher Berner, Sam McCandlish, Alec Radford, Ilya Sutskever, and Dario Amodei. 2020. Language Models are Few-Shot Learners. In Advances in Neural Information Processing Systems, volume 33, pages 1877–1901. Curran Associates, Inc. Sabine Buchholz and Erwin Marsi. 2006. CoNLL-X shared task on multilingual dependency parsing. In Proceedings of the Tenth Conference on Computational Natural Language Learning (CoNLLX), pages 149–164, New York City. Association for Computational Linguistics. Kehai Chen, Tiejun Zhao, Muyun Yang, and Lemao Liu. 2017. Translation prediction with source dependency-based context representation. Proceedings of the AAAI Conference on Artificial Intelligence, 31(1). Alexis Conneau, Kartikay Khandelwal, Naman Goyal, Vishrav Chaudhary, Guillaume Wenzek, Francisco Guzmán, Edouard Grave, Myle Ott, Luke Zettlemoyer, and Veselin Stoyanov. 2020. Unsupervised cross-lingual representation learning at scale. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 8440–8451, Online. Association for Computational Linguistics. Dick Crouch, Mary Dalrymple, Ronald M. Kaplan, Tracy Holloway King, John Maxwell, and Paula Newman. 2011. XLE Documentation. Palo Alto Research Center. Marie-Catherine de Marneffe, Christopher D. Manning, Joakim Nivre, and Daniel Zeman. 2021. Universal Dependencies. Computational Linguistics, 47(2):255–308. Sebastian Gehrmann, Tosin Adewumi, Karmanya Aggarwal, Pawan Sasanka Ammanamanchi, Anuoluwapo Aremu, Antoine Bosselut, Khyathi Raghavi Chandu, Miruna-Adriana Clinciu, Dipanjan Das, Kaustubh Dhole, Wanyu Du, Esin Durmus, Ondřej Dušek, Chris Chinenye Emezue, Varun Gangal, Cristina Garbacea, Tatsunori Hashimoto, Yufang Hou, Yacine Jernite, Harsh Jhamtani, Yangfeng Ji, Shailza Jolly, Mihir Kale, Dhruv Kumar, Faisal Ladhak, Aman Madaan, Mounica Maddela, Khyati Mahajan, Saad Mahamood, Bodhisattwa Prasad Majumder, Pedro Henrique Martins, Angelina McMillan-Major, Simon Mille, Emiel van Miltenburg, Moin Nadeem, Shashi Narayan, Vitaly Nikolaev, Andre Niyongabo Rubungo, Salomey Osei, Ankur Parikh, Laura Perez-Beltrachini, Niranjan Ramesh Rao, Vikas Raunak, Juan Diego Rodriguez, Sashank Santhanam, João Sedoc, Thibault Sellam, Samira Shaikh, Anastasia Shimorina, Marco Antonio Sobrevilla Cabezudo, Hendrik Strobelt, Nishant Subramani, Wei Xu, Diyi Yang, Akhila Yerukola, and Jiawei Zhou. 2021. The GEM benchmark: Natural language generation, its evaluation and metrics. In Proceedings of the 1st Workshop on Natural Language Generation, Evaluation, and Metrics (GEM 2021), pages 96–120, Online. Association for Computational Linguistics. Alex Graves and Jürgen Schmidhuber. 2005. Framewise phoneme classification with bidirectional lstm and other neural network architectures. Neural Networks, 18(5):602–610. IJCNN 2005. Zhijiang Guo, Yan Zhang, and Wei Lu. 2019. Attention guided graph convolutional networks for relation extraction. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 241–251, Florence, Italy. Association for Computational Linguistics. Junjie Hu, Sebastian Ruder, Aditya Siddhant, Graham Neubig, Orhan Firat, and Melvin Johnson. 2020. XTREME: A massively multilingual multitask benchmark for evaluating cross-lingual generalisation. In Proceedings of the 37th Interna- tional Conference on Machine Learning, volume 119 of Proceedings of Machine Learning Research, pages 4411–4421. PMLR. Jungo Kasai, Dan Friedman, Robert Frank, Dragomir Radev, and Owen Rambow. 2019. Syntax-aware neural semantic role labeling with supertags. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 701–709, Minneapolis, Minnesota. Association for Computational Linguistics. Daniel Khashabi, Tushar Khot, Ashish Sabharwal, and Dan Roth. 2018. Question answering as global reasoning over semantic abstractions. Proceedings of the AAAI Conference on Artificial Intelligence, 32(1). Douwe Kiela, Max Bartolo, Yixin Nie, Divyansh Kaushik, Atticus Geiger, Zhengxuan Wu, Bertie Vidgen, Grusha Prasad, Amanpreet Singh, Pratik Ringshia, Zhiyi Ma, Tristan Thrush, Sebastian Riedel, Zeerak Waseem, Pontus Stenetorp, Robin Jia, Mohit Bansal, Christopher Potts, and Adina Williams. 2021. Dynabench: Rethinking benchmarking in NLP. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 4110– 4124, Online. Association for Computational Linguistics. Witold Kieraś and Marcin Woliński. 2017. Morfeusz 2 – analizator i generator fleksyjny dla języka polskiego. Język Polski, XCVII(1):75–83. Mateusz Klimaszewski and Alina Wróblewska. 2021. COMBO: State-of-the-art morphosyntactic analysis. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pages 50–62, Online and Punta Cana, Dominican Republic. Association for Computational Linguistics. Ryan McDonald, Koby Crammer, and Fernando Pereira. 2005. Online large-margin training of dependency parsers. In Proceedings of the 43rd Annual Meeting of the Association for Computational Linguistics (ACL’05), pages 91–98, Ann Arbor, Michigan. Association for Computational Linguistics. Ines Montani and Matthew Honnibal. 2022. spaCy: Industrial-Strength Natural Language Processing in Python. Version 3.4.1. Robert Mroczkowski, Piotr Rybak, Alina Wróblewska, and Ireneusz Gawlik. 2021. HerBERT: Efficiently pretrained transformerbased language model for Polish. In Proceedings of the 8th Workshop on Balto-Slavic Natural Language Processing, pages 1–10, Kiyv, Ukraine. Association for Computational Linguistics. Minh Van Nguyen, Viet Dac Lai, Amir Pouran Ben Veyseh, and Thien Huu Nguyen. 2021a. Trankit: A light-weight transformer-based toolkit for multilingual natural language processing. In Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: System Demonstrations, pages 80–90, Online. Association for Computational Linguistics. Minh Van Nguyen, Viet Dac Lai, Amir Pouran Ben Veyseh, and Thien Huu Nguyen. 2021b. Trankit: A light-weight transformer-based toolkit for multilingual natural language processing. In Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: System Demonstrations. Joakim Nivre. 2009. Non-projective dependency parsing in expected linear time. In Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP, pages 351–359, Suntec, Singapore. Association for Computational Linguistics. Long Ouyang, Jeff Wu, Xu Jiang, Diogo Almeida, Carroll L. Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, John Schulman, Jacob Hilton, Fraser Kelton, Luke Miller, Maddie Simens, Amanda Askell, Peter Welinder, Paul Christiano, Jan Leike, and Ryan Lowe. 2022. Training language models to follow instructions with human feedback. Adrien Pavao, Isabelle Guyon, Anne-Catherine Letournel, Xavier Baró, Hugo Escalante, Sergio Escalera, Tyler Thomas, and Zhen Xu. 2022. Codalab competitions: An open source platform to organize scientific challenges. Technical report, Université Paris-Saclay. Jonas Pfeiffer, Andreas Rücklé, Clifton Poth, Aishwarya Kamath, Ivan Vulić, Sebastian Ruder, Kyunghyun Cho, and Iryna Gurevych. 2020a. AdapterHub: A framework for adapting transformers. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pages 46–54, Online. Association for Computational Linguistics. Jonas Pfeiffer, Ivan Vulić, Iryna Gurevych, and Sebastian Ruder. 2020b. MAD-X: An AdapterBased Framework for Multi-Task Cross-Lingual Transfer. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 7654–7673, Online. Association for Computational Linguistics. Adam Przepiórkowski, Mirosław Bańko, Rafał L. Górski, and Barbara Lewandowska-Tomaszczyk, editors. 2012. Narodowy Korpus Języka Polskiego. Wydawnictwo Naukowe PWN, Warsaw. Piotr Przybyła. 2022. LAMBO: Layered Approach to Multi-level BOundary identification. Peng Qi, Yuhao Zhang, Yuhui Zhang, Jason Bolton, and Christopher D. Manning. 2020. Stanza: A python natural language processing toolkit for many human languages. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics: System Demonstrations, pages 101–108, Online. Association for Computational Linguistics. Piotr Rybak and Alina Wróblewska. 2018. Semisupervised neural system for tagging, parsing and lematization. In Proceedings of the CoNLL 2018 Shared Task: Multilingual Parsing from Raw Text to Universal Dependencies, pages 45– 54, Brussels, Belgium. Association for Computational Linguistics. Devendra Sachan, Yuhao Zhang, Peng Qi, and William L. Hamilton. 2021. Do syntax trees help pre-trained transformers extract information? In Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume, pages 2647– 2661, Online. Association for Computational Linguistics. Djamé Seddah, Reut Tsarfaty, Sandra Kübler, Marie Candito, Jinho D. Choi, Richárd Farkas, Jennifer Foster, Iakes Goenaga, Koldo Gojenola Galletebeitia, Yoav Goldberg, Spence Green, Nizar Habash, Marco Kuhlmann, Wolfgang Maier, Joakim Nivre, Adam Przepiórkowski, Ryan Roth, Wolfgang Seeker, Yannick Versley, Veronika Vincze, Marcin Woliński, Alina Wróblewska, and Eric Villemonte de la Clergerie. 2013. Overview of the SPMRL 2013 shared task: A cross-framework evaluation of parsing morphologically rich languages. In Proceedings of the Fourth Workshop on Statistical Parsing of Morphologically-Rich Languages, pages 146– 182, Seattle, Washington, USA. Association for Computational Linguistics. Milan Straka, Jan Hajič, and Jana Straková. 2016. UDPipe: Trainable pipeline for processing CoNLL-U files performing tokenization, morphological analysis, POS tagging and parsing. In Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC’16), pages 4290–4297, Portorož, Slovenia. European Language Resources Association (ELRA). Milan Straka and Jana Straková. 2017. Tokenizing, pos tagging, lemmatizing and parsing ud 2.0 with udpipe. In Proceedings of the CoNLL 2017 Shared Task: Multilingual Parsing from Raw Text to Universal Dependencies, pages 88–99, Vancouver, Canada. Association for Computational Linguistics. Kai Sun, Richong Zhang, Samuel Mensah, Yongyi Mao, and Xudong Liu. 2019. Aspect-level sentiment analysis via convolution over dependency tree. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 5679–5688, Hong Kong, China. Association for Computational Linguistics. Łukasz Szałkiewicz and Adam Przepiórkowski. 2012. Anotacja morfoskładniowa. In Adam Przepiórkowski, Mirosław Bańko, Rafał L. Górski, and Barbara Lewandowska-Tomaszczyk, editors, Narodowy Korpus Języka Polskiego, pages 59– 96. Wydawnictwo Naukowe PWN, Warsaw. Shikhar Vashishth, Rishabh Joshi, Sai Suman Prayaga, Chiranjib Bhattacharyya, and Partha Talukdar. 2018. RESIDE: Improving distantlysupervised neural relation extraction using side information. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 1257–1266, Brussels, Belgium. Association for Computational Linguistics. Alex Wang, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel Bowman. 2018. GLUE: A multi-task benchmark and analysis platform for natural language understanding. In Proceedings of the 2018 EMNLP Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP, pages 353–355, Brussels, Belgium. Association for Computational Linguistics. Yufei Wang, Mark Johnson, Stephen Wan, Yifang Sun, and Wei Wang. 2019. How to best use syntax in semantic role labelling. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 5338–5343, Florence, Italy. Association for Computational Linguistics. Jakub Waszczuk. 2012. Harnessing the crf complexity with domain-specific constraints. the case of morphosyntactic tagging of a highly inflected language. In Proceedings of COLING 2012, pages 2789–2804. Jakub Waszczuk, Witold Kieraś, and Marcin Woliński. 2018. Morphosyntactic disambiguation and segmentation for historical polish with graph-based conditional random fields. In International Conference on Text, Speech, and Dialogue, pages 188–196. Springer. Marcin Woliński. 2014. Morfeusz reloaded. In Proceedings of the Ninth International Conference on Language Resources and Evaluation, LREC 2014, pages 1106–1111. European Language Resources Association (ELRA). Marcin Woliński. 2019. Automatyczna analiza składnikowa języka polskiego. Wydawnictwa Uniwersytetu Warszawskiego, Warsaw. Daniel Zeman, Jan Hajič, Martin Popel, Martin Potthast, Milan Straka, Filip Ginter, Joakim Nivre, and Slav Petrov. 2018. CoNLL 2018 shared task: Multilingual parsing from raw text to Universal Dependencies. In Proceedings of the CoNLL 2018 Shared Task: Multilingual Parsing from Raw Text to Universal Dependencies, pages 1–21, Brussels, Belgium. Association for Computational Linguistics. Daniel Zeman, Martin Popel, Milan Straka, Jan Hajič, Joakim Nivre, Filip Ginter, Juhani Luotolahti, Sampo Pyysalo, Slav Petrov, Martin Potthast, Francis Tyers, Elena Badmaeva, Memduh Gokirmak, Anna Nedoluzhko, Silvie Cinková, Jan Hajič jr., Jaroslava Hlaváčová, Václava Kettnerová, Zdeňka Urešová, Jenna Kanerva, Stina Ojala, Anna Missilä, Christopher D. Manning, Sebastian Schuster, Siva Reddy, Dima Taji, Nizar Habash, Herman Leung, MarieCatherine de Marneffe, Manuela Sanguinetti, Maria Simi, Hiroshi Kanayama, Valeria de Paiva, Kira Droganova, Héctor Martínez Alonso, Çağrı Çöltekin, Umut Sulubacak, Hans Uszkoreit, Vivien Macketanz, Aljoscha Burchardt, Kim Harris, Katrin Marheinecke, Georg Rehm, Tolga Kayadelen, Mohammed Attia, Ali Elkahky, Zhuoran Yu, Emily Pitler, Saran Lertpradit, Michael Mandl, Jesse Kirchner, Hector Fernandez Alcalde, Jana Strnadová, Esha Banerjee, Ruli Manurung, Antonio Stella, Atsuko Shimada, Sookyoung Kwak, Gustavo Mendonça, Tatiana Lando, Rattima Nitisaroj, and Josie Li. 2017. CoNLL 2017 shared task: Multilingual parsing from raw text to Universal Dependencies. In Proceedings of the CoNLL 2017 Shared Task: Multilingual Parsing from Raw Text to Universal Dependencies, pages 1–19, Vancouver, Canada. Association for Computational Linguistics. Meishan Zhang, Zhenghua Li, Guohong Fu, and Min Zhang. 2019. Syntax-enhanced neural machine translation with syntax-aware word representations. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 1151–1161, Minneapolis, Minnesota. Association for Computational Linguistics. Yuhao Zhang, Peng Qi, and Christopher D. Manning. 2018. Graph convolution over pruned dependency trees improves relation extraction. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 2205–2215, Brussels, Belgium. Association for Computational Linguistics. 9. Language Resource References Alexis Conneau and Kartikay Khandelwal and Naman Goyal and Vishrav Chaudhary and Guillaume Wenzek and Francisco Guzmán and Edouard Grave and Myle Ott and Luke Zettlemoyer and Veselin Stoyanov. 2019. XLMRoBERTa. Hugging Face. Grave, Edouard and Bojanowski, Piotr and Gupta, Prakhar and Joulin, Armand and Mikolov, Tomas. 2018. fastText. Facebook. Kłeczek, Dariusz. 2021. Polbert. Hugging Face. Lynn, Teresa and Foster, Jennifer and McGuinness, Sarah and Walsh, Abigail and Phelan, Jason and Scannell, Kevin. 2015. Irish Dependency Treebank (UD Irish-IDT). Universal Dependencies Consortium. PID http://hdl.handle.net/11234/1- 4611. Mroczkowski, Robert and Rybak, Piotr and Wróblewska, Alina and Gawlik, Ireneusz. 2021. HerBERT. Hugging Face. Przepiórkowski, Adam and Bańko, Mirosław and Górski, Rafał L. and Lewandowska-Tomaszczyk, Barbara. 2018. National Corpus of Polish. Institute of Computer Science. Shen, Mo and McDonald, Ryan and Zeman, Daniel and Qi, Peng. 2019. Chinese Dependency Treebank (UD Chinese-GSD). Universal Dependencies Consortium. PID http://hdl.handle.net/11234/1-4611. Wróblewska, Alina. 2018. Polish Dependency Bank (UD Polish-PDB). Universal Dependencies Consortium. PID http://hdl.handle.net/11234/1-5150. This paper is available on arxiv under CC BY-NC-SA 4.0 DEED license. This paper is available on arxiv under CC BY-NC-SA 4.0 DEED license. available on arxiv