This story draft by @textmining has not been reviewed by an editor, YET.
Domain and Task
Related Work
3.1. Text mining and NLP research overview
3.2. Text mining and NLP in industry use
4.6. XML parsing, data joining, and risk indices development
Experiment and Demonstration
Discussion
6.1. The ‘industry’ focus of the project
6.2. Data heterogeneity, multilingual and multi-task nature
Text mining and NLP have been long established research fields for decades. Their techniques have also been widely adopted in industries to develop and deploy intelligent systems for automated analysis of large scale text data. However, the literature is dominated by research that favours supervised methods built on well-curated data. Solutions in such a ‘lab environment’ often do not transfer well to practical scenarios. Instead, studies reporting industrial text mining/NLP tasks often make use of rule-based methods and domain lexicons. But they typically look at single and sometimes simplified tasks, and do not discuss in-depth data heterogeneity and inconsistency and their implication on the development of their methods. Further, few prior work has focused on the healthcare domain.
Set in this context, our work describes an industry project that developed text mining methods and solutions to mine millions of heterogeneous, multilingual procurement documents in the healthcare sector. We extract structured procurement contract data and store them in a database that drives a platform enabling easy evaluation of supplier risks. Our work sets reference for future research and practice in many ways: 1) it develops the first structured procurement contract database that will help facilitate the tendering process; 2) it documents a method that effectively uses domain knowledge and generalises to multiple text mining and NLP tasks and languages; 3) and it discusses lessons learned for practical text mining/NLP development.
Drawing from our lessons, we make a few recommendations for researchers and practitioners. First, we argue that research needs to ‘step out of the lab environment’ by using data that more reflects reality. Research data are typically well-curated and pre-processed. But as we have seen, in practice, real data is rarely good quality and highly inconsistent. This means that practitioners often need to make a significant effort to cleanse their data, or adapt state-of-the-art from research. Both are non-trivial. Also, rules continue to be important and effective in many real world applications, as they are easier to implement, fit for purpose, and easy to interpret. We believe an interesting direction is for model explainability research to develop methods that can explain model decisions in terms of rules beyond the current ‘primitive’ approaches (e.g., feature weights, attentions). These may offer valuable insights for building domain-specific applications. Third, for practitioners, we recommend that they focus on their real needs when it comes to algorithmic choices. While the recent text mining and NLP research has seen deep neural networks - especially very large language models trained on massive corpora - taking over the centre stage, the added value to businesses in practice may depend on the domain and task. This is particularly important if the business has restricted access to resources, as these methods are much more resource intensive than classic machine learning models. Finally, building industrial text mining and NLP applications usually entails a process involving multiple tasks. While often, there can be tried-and-tested methods for each task, one needs to again consider their resource constraints and it helps to think in terms of building solutions that can generalise to a wide range of tasks instead of buying or adapting ad-hoc solutions for each task.
Our work has a number of limitations. First, we have not evaluated the end system, i.e., the platform for deriving supplier risk profiles. This is primarily due to the work being taken further for development by the industry partner before being presented to end users. An end-user evaluation would be an extremely valuable exercise to examine the effectiveness of our text mining and NLP methods. Second, our work has focused on a specific sub-area of healthcare - pharmaceuticals. This is arguably an easier sub-area compared to medical equipment where the naming and standards can be very inconsistent. Therefore, it is difficult to conclude how our methods can generalise to these areas.
In terms of future work, we identify three main directions. First, we will look at adapting our solution to other areas of the healthcare sector (e.g., medical equipment as mentioned above), or other sectors. Second, while within the project, we only analysed procurement documents, another source of useful information is supplier websites and their product catalogues. We envisage to mine such data in the future to enrich our database. Finally, we recognise a lack of research in the area of procurement text mining and NLP. For this reason, we plan to release part of our data (subject to further processing to redact sensitive information) for use by the research community and set up shared task to encourage effort on this direction.
Part of this work was funded by the Innovate UK under the project 90205 ‘AI-powered real-time healthcare supplier profile and COVID-19 supply risk matrix’.
Ahmad, K., Gillam, L., and Tostevin, L. (1999). University of surrey participation in trec 8: Weirdness indexing for logical document extrapolation and retrieval (wilder). In Proceedings of the 8th Text REtrieval Conference.
Alsentzer, E., Murphy, J., Boag, W., Weng, W., Jindi, D., Naumann, T., McDermott, M. (2019). Publicly available clinical BERT embeddings. Proceedings of the 2nd Clinical Natural Language Processing Workshop, pp. 72–78. Association for Computational Linguistics, Minneapolis, Minnesota, USA. https://doi.org/10.18653/v1/W19-1909
Beltagy, I., Lo, K., Cohan, A. (2019). SciBERT: A pretrained language model for scientific text. Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pp. 3615–3620. Association for Computational Linguistics, Hong Kong, China. https://doi.org/10.18653/v1/D19-1371
Bach, M., Krstić, Ž. and Seljan, S. (2019) Big data text mining in the financial sector. U: Metawa, N., Elhoseny, M., Hassanien, A. & Hassan, M. (ur.) Expert Systems in Finance: Smart Financial Applications in Big Data Environments. London, Routledge, str. 80- 96 doi:10.4324/9780429024061.
Bassignana, E. and Plank, B. (2022). What Do You Mean by Relation Extraction? A Survey on Datasets and Study on Scientific Relation Classification. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics: Student Research Workshop, pages 67–83, Dublin, Ireland. Association for Computational Linguistics.
Birunda, S., Devi, R. (2021). A Review on Word Embedding Techniques for Text Classification. In: Raj, J.S., Iliyasu, A.M., Bestak, R., Baig, Z.A. (eds) Innovative Data Communication Technologies and Application. Lecture Notes on Data Engineering and Communications Technologies, vol 59. Springer, Singapore. https://doi.org/10.1007/978-981-15-9651-3_23
Chalkidis, I., Androutsopoulos, I. and Michos, A. (2017). Extracting contract elements. In Proceedings of the 16th edition of the International Conference on Articial Intelligence and Law (ICAIL '17). Association for Computing Machinery, New York, NY, USA, 19–28. https://doi.org/10.1145/3086512.3086515
Chatterjee, S. (2019). Explaining customer ratings and recommendations by combining qualitative and quantitative user generated contents. Dec. Supp. Syst., 119 , 14–22. 10.1016/j.dss.2019.02.008.
Chiticariu, L., Li, Y. and Reiss, F. (2013). Rule-Based Information Extraction is Dead! Long Live Rule-Based Information Extraction Systems! In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, pages 827–832, Seattle, Washington, USA. Association for Computational Linguistics.
Choi, S., Choi, S., Kim, J. and Lee, E. (2021). AI and Text-Mining Applications for Analyzing Contractor’s Risk in Invitation to Bid (ITB) and Contracts for Engineering Procurement and Construction (EPC) Projects. Energies 2021, 14, 4632. https://doi.org/10.3390/en14154632
Devlin, J., Chang, M., Lee, K. and Toutanova, K. (2018). BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. arXiv:1810.04805v2
Dominika, K. (2021). Machine Learning in Terminology Extraction from Czech and English Texts. Linguistic Frontiers, 0 (0), https://doi.org/10.2478/lf-2021-0001
Fang, Y., Wang, H., Zhao, L., Yu, F. and Wang, C. (2020). Dynamic knowledge graph based fake-review detection. Appl Intell 50, 4281-4295. https://doi.org/10.1007/s10489-020-01761-w
Fantoni, G., Coli, E., Chiarello, F., Apreda, R., Dell’Orletta, F. and Pratelli, G. (2021). Text mining tool for translating terms of contract into technical specifications: Development and application in the railway sector, Computers in Industry, Volume 124, 103357, https://doi.org/10.1016/j.compind.2020.103357.
Francia, A., Alejandra, O., Nunez-del-Prado, M. and Alatrista-Salas, H. (2022). Survey of Text Mining Techniques Applied to Judicial Decisions Prediction. Applied Sciences 12, no. 20: 10200. https://doi.org/10.3390/app122010200
Geetha, M. , Singha, P. and Sinha, S. (2017). Relationship between customer sentiment and online customer ratings for hotels-An empirical analysis. Tour. Manag., 61 , 43–54
Grandia, J., Kruyen, P. (2020). Assessing the implementation of sustainable public procurement using quantitative text-analysis tools: A large-scale analysis of Belgian public procurement notices, Journal of Purchasing and Supply Management, Volume 26, Issue 4, 100627, https://doi.org/10.1016/j.pursup.2020.100627.
Grishman, R. and Sundheim, B. (1996). Message Understanding Conference-6: a brief history. In Proceedings of the 16th conference on Computational linguistics - Volume 1 (COLING '96). Association for Computational Linguistics, USA, 466–471. https://doi.org/10.3115/992628.992709
Haddadi, T., Haddadi, Q., Mourabit, T., Allaoui, A. and Ahmed, M. (2021). Automatic analysis of the sustainability of public procurement based on Text Mining: The case of the Moroccan ICT markets, Cleaner and Responsible Consumption, Volume 3, 100037, https://doi.org/10.1016/j.clrc.2021.100037.
Han, X., Gao, T., Lin, Y., Peng, H., Yang, Y., Xiao, C., Liu, Z., Li, P., Zhou, J. and Sun, M. 2020. More Data, More Relations, More Context and More Openness: A Review and Outlook for Relation Extraction. In Proceedings of the 1st Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics and the 10th International Joint Conference on Natural Language Processing, pages 745–758, Suzhou, China. Association for Computational Linguistics.
Hu, Z., Li, X., Tu, G., Liu, Z. and Sun, M. (2018). Few-shot charge prediction with discriminative legal attributes. In Proceedings of COLING.
Kao, A. and Poteet, S. (2007). Overview, In: Kao, A., and Poteet, S.R., (eds). Natural Language Processing and Text Mining. Springer-Verlag, New York, 2007.
Khan, J.A., Liu, L., Wen, L. (2020). Requirements knowledge acquisition from online user forums. Iet Softw. 14 (3), 242–253.
Köseoglu, M., Mehraliyev, F., Altin, M. and Okumus, F. (2021), Competitor intelligence and analysis (CIA) model and online reviews: integrating big data text mining with network analysis for strategic analysis, Tourism Review, Vol. 76 No. 3, pp. 529-552. https://doi.org/10.1108/TR-10-2019-0406
Kowsari, K., Meimandi, K., Heidarysafa, M., Mendu, S., Barnes, L. and Brown, D. (2019). Text Classification Algorithms: A Survey. Information, 10(4):150. https://doi.org/10.3390/info10040150
Krishna, R., Yu, Z., Agrawal, A., Dominguez, M. and Wolf, D. (2016). The 'BigSE' Project: Lessons Learned from Validating Industrial Text Mining, IEEE/ACM 2nd International Workshop on Big Data Software Engineering (BIGDSE), 2016, pp. 65-71.
Krzywicki, A., Wobcke, W., Bain, M., Calvo Martinez, J., and Compton, P. (2016). Data mining for building knowledge bases: Techniques, architectures and applications. The Knowledge Engineering Review, 31(2), 97-123. doi:10.1017/S0269888916000047
Kumar, S., Kar, A. and Ilavarasan, P. (2021) Applications of text mining in services management: A systematic literature review, International Journal of Information Management Data Insights, Volume 1, Issue 1, 100008, https://doi.org/10.1016/j.jjimei.2021.100008.
Lee, S., Kim, B., Huh, M., Park, J., Kang, S., Cho, S., Lee, D. and Lee, D. (2014). Knowledge discovery in inspection reports of marine structures, Expert Systems with Applications, 41 (4), Part 1, 1153-1167
Li, Y., Guzman, E., Tsiamoura, K., Schneider, F., Bruegge, B. (2015). Automated requirements extraction for scientific software. Procedia Comput. Sci. 51, 582–591.
Llopis, F., Muñoz, R., Terol, R. and Noguera, E. (2005). IR-n r2: Using Normalized Passages. In: Peters, C., Clough, P., Gonzalo, J., Jones, G.J.F., Kluck, M., Magnini, B. (eds) Multilingual Information Access for Text, Speech and Images. CLEF 2004. Lecture Notes in Computer Science, vol 3491. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11519645_9
Modrusan, N., Rabuzin, K. and Mrsic, L. (2020), Improving Public Sector Efficiency using Advanced Text Mining in the Procurement Process, DATA
Nasar, Z., Jaffry, S. and Malik, M. (2021). Named Entity Recognition and Relation Extraction: State-of-the-Art. ACM Comput. Surv. 54, 1, Article 20. https://doi.org/10.1145/3445965
Nave, M. , Rita, P. and Guerreiro, J. (2018). A decision support system framework to track consumer sentiments in social media. J. Hospitality Mark. Manag., 27 (6), 693–710
Othman, N. and Faiz, R. (2016). A Relevant Passage Retrieval and Re-ranking Approach for Open-Domain Question Answering, EGC. Revue des Nouvelles Technologies de l’Information vol. RNTI, Ed. Hermann, 111-122.
Paliwal, S., Vishwanath, D., Rahul, R., Sharma, M. and Vig, L. (2019). Tablenet: Deep learning model for end-to-end table detection and tabular data extraction from scanned document images. In 2019 International Conference on Document Analysis and Recognition (ICDAR), pages 128–133. IEEE,.
Park, S. , Kim, J. , Lee, Y. and Ok, C. (2020). Visualizing theme park visitors’ emotions using social media analytics and geospatial analytics. Tour. Manag., 80 , Article 104127.
Piskorski, J., Babych, B., Kancheva, Z., Kanishcheva, O., Lebedeva, M., Marcińczuk, M., Nakov, P., Osenova, P., Pivovarova, L., Pollak, S., Přibáň, P., Radev, I., Robnik-Sikonja, M., Starko, V., Steinberger, J. and Yangarber. R. (2021). Slav-NER: the 3rd Cross-lingual Challenge on Recognition, Normalization, Classification, and Linking of Named Entities across Slavic Languages. In Proceedings of the 8th Workshop on Balto-Slavic Natural Language Processing, pages 122–133, Kiyv, Ukraine. Association for Computational Linguistics.
Rabuzin, K. and Modrusan, N. (2019). Prediction of public procurement corruption indices using machine learning methods. In: 11th International Conference on Knowledge Management and Information Systems, Vienna
Sang, E. and Meulder, D. (2003). Introduction to the CoNLL-2003 Shared Task: Language-Independent Named Entity Recognition. In Proceedings of the Seventh Conference on Natural Language Learning at HLT-NAACL 2003, pages 142–147.
Sharma, S. , Chakraborti, S. and Jha, T. (2019). Analysis of book sales prediction at Amazon marketplace in India: a machine learning approach. Inf. Syst. e-Bus. Manag., 17 (2-4), 261–284
Singh, K., Devi, S., Devi, H. and Mahanta, A. (2022). A novel approach for dimension reduction using word embedding: An enhanced text classification approach, International Journal of Information Management Data Insights, 2(1), 100061, https://doi.org/10.1016/j.jjimei.2022.100061.
Suganthan, P., Sun, C., Gayatri, K., Zhang, H., Yang, F., Rampalli, N., Prasad, S., Arcaute, E., Krishnan, G., Deep, R., Raghavendra, V. and Doan. A. (2015). Why Big Data Industrial Systems Need Rules and What We Can Do About It. In Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data (SIGMOD '15). Association for Computing Machinery, New York, NY, USA, 265–276. https://doi.org/10.1145/2723372.2742784
Tian, X., He, W., Tang, C., Li, L., Xu, H. and Selover, D. (2019). A new approach of social media analytics to predict service quality: evidence from the airline industry. J. Enterprise Inf. Manag. . 10.1108/JEIM-03-2019-0086.
Tian, D., Li, M., Shi, J., Shen, Y., Han, S. (2021). On-site text classification and knowledge mining for large-scale projects construction by integrated intelligent approach, Advanced Engineering Informatics, Volume 49, 101355, https://doi.org/10.1016/j.aei.2021.101355.
Tiun, S., Mokhtar, U., Bakar, S. and Saad, S. (2020). Classification of functional and non-functional requirement in software requirement using Word2vec and fast text. J. Phys. Conf. Ser. 1529, 042077
Xiao, C., Zhong, H., Guo, Z., Tu, C., Liu, Z., Sun, M., Zhang, T., Han, X., Wang, H. and Xu, J. (2019).
Cail2019-scm: A dataset of similar case matching in legal domain. arXiv preprint arXiv:1911.08962.
Yang, B., Liu, Y., Liang, Y. and Tang, M. (2019). Exploiting user experience from online customer reviews for product design. International Journal of Information Management, 46 , 173-186.
Zhang, S. and Balog, K. (2020). Web Table Extraction, Retrieval, and Augmentation: A Survey. ACM Trans. Intell. Syst. Technol. 11, 2, Article 13 (April 2020), 35 pages. https://doi.org/10.1145/3372117
Zhang, F., Fleyeh, H., Wang, X. and Lu, M. (2019). Construction site accident analysis using text mining and natural language processing techniques, Automation in Construction, Volume 99, 238-248, https://doi.org/10.1016/j.autcon.2018.12.016.
Zhong, H., Xiao, C., Tu, C., Zhang, T., Liu, Z. and Sun, M. (2020). How Does NLP Benefit Legal System: A Summary of Legal Artificial Intelligence. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 5218–5230, Online. Association for Computational Linguistics.
Zhu, F., Lei, W., Wang, C., Zheng, J., Poria, S. and Chua, T. (2021). Retrieving and reading: A comprehensive survey on open-domain question answering. arXiv preprint arXiv:2101.00774
Authors:
(1) Ziqi Zhang*, Information School, the University of Sheffield, Regent Court, Sheffield, UKS1 4DP ([email protected]);
(2) Tomas Jasaitis, Vamstar Ltd., London ([email protected]);
(3) Richard Freeman, Vamstar Ltd., London ([email protected]);
(4) Rowida Alfrjani, Information School, the University of Sheffield, Regent Court, Sheffield, UKS1 4DP ([email protected]);
(5) Adam Funk, Information School, the University of Sheffield, Regent Court, Sheffield, UKS1 4DP ([email protected]).
This paper is