Table of Links Abstract and Introduction Domain and Task 2.1. Data sources and complexity 2.2. Task definition Related Work 3.1. Text mining and NLP research overview 3.2. Text mining and NLP in industry use 3.3. Text mining and NLP for procurement 3.4. Conclusion from literature review Proposed Methodology 4.1. Domain knowledge 4.2. Content extraction 4.3. Lot zoning 4.4. Lot item detection 4.5. Lot parsing 4.6. XML parsing, data joining, and risk indices development Experiment and Demonstration 5.1. Component evaluation 5.2. System demonstration Discussion 6.1. The ‘industry’ focus of the project 6.2. Data heterogeneity, multilingual and multi-task nature 6.3. The dilemma of algorithmic choices 6.4. The cost of training data Conclusion, Acknowledgements, and References 6.2. Data heterogeneity, multilingual and multi-task nature As already mentioned many times, compared to research as well as other industry text mining and NLP practices reported in the literature, the data we dealt with are multilingual and heterogeneous such that the goal of the project cannot be achieved by a single type of method, but relates to multiple subfields of text mining and NLP research. While there may be well-established solutions in each of these subfields, as discussed before, the difference in the raw data analysed, the multilingual nature and the lack of training data in our project makes adopting these methods very difficult and time-consuming. Considering the wider principle mentioned above, another lesson we learned is that given such a high level of complexity in our tasks, it is imperative to develop ‘lightweight’ methods that are easy to maintain, or ‘generalised’solutions that could apply to multiple tasks, or both. On reflection, we opted for treating many tasks as text classification, using a feature representation scheme that is arguably language independent, and applies to, or can be easily adapted to, all of the tasks. Specifically, two components of our pipeline (lot zoning and lot item detection) deal with text classification at different granularity and therefore, the machine learning algorithms can be easily reused across. The feature representations are derived with reference to domain lexicons, which can be more easily and affordably translated into multiple languages - particularly considering the massive amount of multilingual documents we have to analyse, and the fact that only a fraction of them contain really useful content for analysis. Another added benefit of our approach is that it enables domain experts and the system administrators to update the trained models by simply revising the domain lexicons during re-training. To the best of our knowledge, there is no prior literature that describes in detail a holistic text mining or NLP process composed of multiple sub-tasks, or develops generalisable solutions like ours. Therefore, our work sets an important reference for future industry projects dealing with complex tasks. Authors: (1) Ziqi Zhang*, Information School, the University of Sheffield, Regent Court, Sheffield, UKS1 4DP (Ziqi.Zhang@sheffield.ac.uk); (2) Tomas Jasaitis, Vamstar Ltd., London (Tomas.Jasaitis@vamstar.io); (3) Richard Freeman, Vamstar Ltd., London (Richard.Freeman@vamstar.io); (4) Rowida Alfrjani, Information School, the University of Sheffield, Regent Court, Sheffield, UKS1 4DP (Rowida.Alfrjani@sheffield.ac.uk); (5) Adam Funk, Information School, the University of Sheffield, Regent Court, Sheffield, UKS1 4DP (Adam.Funk@sheffield.ac.uk). This paper is available on arxiv under CC BY 4.0 license. Table of Links Abstract and Introduction Domain and Task 2.1. Data sources and complexity 2.2. Task definition Related Work 3.1. Text mining and NLP research overview 3.2. Text mining and NLP in industry use 3.3. Text mining and NLP for procurement 3.4. Conclusion from literature review Proposed Methodology 4.1. Domain knowledge 4.2. Content extraction 4.3. Lot zoning 4.4. Lot item detection 4.5. Lot parsing 4.6. XML parsing, data joining, and risk indices development Experiment and Demonstration 5.1. Component evaluation 5.2. System demonstration Discussion 6.1. The ‘industry’ focus of the project 6.2. Data heterogeneity, multilingual and multi-task nature 6.3. The dilemma of algorithmic choices 6.4. The cost of training data Conclusion, Acknowledgements, and References Abstract and Introduction Abstract and Introduction Abstract and Introduction Abstract and Introduction Domain and Task 2.1. Data sources and complexity 2.2. Task definition Domain and Task Domain and Task 2.1. Data sources and complexity 2.1. Data sources and complexity 2.2. Task definition 2.2. Task definition Related Work 3.1. Text mining and NLP research overview 3.2. Text mining and NLP in industry use 3.3. Text mining and NLP for procurement 3.4. Conclusion from literature review Related Work Related Work 3.1. Text mining and NLP research overview 3.1. Text mining and NLP research overview 3.2. Text mining and NLP in industry use 3.2. Text mining and NLP in industry use 3.3. Text mining and NLP for procurement 3.3. Text mining and NLP for procurement 3.4. Conclusion from literature review 3.4. Conclusion from literature review Proposed Methodology 4.1. Domain knowledge 4.2. Content extraction 4.3. Lot zoning 4.4. Lot item detection 4.5. Lot parsing 4.6. XML parsing, data joining, and risk indices development Proposed Methodology Proposed Methodology Proposed Methodology 4.1. Domain knowledge 4.1. Domain knowledge 4.2. Content extraction 4.2. Content extraction 4.3. Lot zoning 4.3. Lot zoning 4.4. Lot item detection 4.4. Lot item detection 4.5. Lot parsing 4.5. Lot parsing 4.6. XML parsing, data joining, and risk indices development 4.6. XML parsing, data joining, and risk indices development Experiment and Demonstration 5.1. Component evaluation 5.2. System demonstration Experiment and Demonstration Experiment and Demonstration 5.1. Component evaluation 5.1. Component evaluation 5.2. System demonstration 5.2. System demonstration Discussion 6.1. The ‘industry’ focus of the project 6.2. Data heterogeneity, multilingual and multi-task nature 6.3. The dilemma of algorithmic choices 6.4. The cost of training data Discussion Discussion 6.1. The ‘industry’ focus of the project 6.1. The ‘industry’ focus of the project 6.2. Data heterogeneity, multilingual and multi-task nature 6.2. Data heterogeneity, multilingual and multi-task nature 6.3. The dilemma of algorithmic choices 6.3. The dilemma of algorithmic choices 6.4. The cost of training data 6.4. The cost of training data Conclusion, Acknowledgements, and References Conclusion, Acknowledgements, and References Conclusion, Acknowledgements, and References Conclusion, Acknowledgements, and References 6.2. Data heterogeneity, multilingual and multi-task nature As already mentioned many times, compared to research as well as other industry text mining and NLP practices reported in the literature, the data we dealt with are multilingual and heterogeneous such that the goal of the project cannot be achieved by a single type of method, but relates to multiple subfields of text mining and NLP research. While there may be well-established solutions in each of these subfields, as discussed before, the difference in the raw data analysed, the multilingual nature and the lack of training data in our project makes adopting these methods very difficult and time-consuming. Considering the wider principle mentioned above, another lesson we learned is that given such a high level of complexity in our tasks, it is imperative to develop ‘lightweight’ methods that are easy to maintain, or ‘generalised’solutions that could apply to multiple tasks, or both. On reflection, we opted for treating many tasks as text classification, using a feature representation scheme that is arguably language independent, and applies to, or can be easily adapted to, all of the tasks. Specifically, two components of our pipeline (lot zoning and lot item detection) deal with text classification at different granularity and therefore, the machine learning algorithms can be easily reused across. The feature representations are derived with reference to domain lexicons, which can be more easily and affordably translated into multiple languages - particularly considering the massive amount of multilingual documents we have to analyse, and the fact that only a fraction of them contain really useful content for analysis. Another added benefit of our approach is that it enables domain experts and the system administrators to update the trained models by simply revising the domain lexicons during re-training. To the best of our knowledge, there is no prior literature that describes in detail a holistic text mining or NLP process composed of multiple sub-tasks, or develops generalisable solutions like ours. Therefore, our work sets an important reference for future industry projects dealing with complex tasks. Authors: (1) Ziqi Zhang*, Information School, the University of Sheffield, Regent Court, Sheffield, UKS1 4DP (Ziqi.Zhang@sheffield.ac.uk); (2) Tomas Jasaitis, Vamstar Ltd., London (Tomas.Jasaitis@vamstar.io); (3) Richard Freeman, Vamstar Ltd., London (Richard.Freeman@vamstar.io); (4) Rowida Alfrjani, Information School, the University of Sheffield, Regent Court, Sheffield, UKS1 4DP (Rowida.Alfrjani@sheffield.ac.uk); (5) Adam Funk, Information School, the University of Sheffield, Regent Court, Sheffield, UKS1 4DP (Adam.Funk@sheffield.ac.uk). Authors: Authors: (1) Ziqi Zhang*, Information School, the University of Sheffield, Regent Court, Sheffield, UKS1 4DP (Ziqi.Zhang@sheffield.ac.uk); (2) Tomas Jasaitis, Vamstar Ltd., London (Tomas.Jasaitis@vamstar.io); (3) Richard Freeman, Vamstar Ltd., London (Richard.Freeman@vamstar.io); (4) Rowida Alfrjani, Information School, the University of Sheffield, Regent Court, Sheffield, UKS1 4DP (Rowida.Alfrjani@sheffield.ac.uk); (5) Adam Funk, Information School, the University of Sheffield, Regent Court, Sheffield, UKS1 4DP (Adam.Funk@sheffield.ac.uk). This paper is available on arxiv under CC BY 4.0 license. This paper is available on arxiv under CC BY 4.0 license. available on arxiv available on arxiv