Table of Links Abstract and Introduction Domain and Task 2.1. Data sources and complexity 2.2. Task definition Related Work 3.1. Text mining and NLP research overview 3.2. Text mining and NLP in industry use 3.3. Text mining and NLP for procurement 3.4. Conclusion from literature review Proposed Methodology 4.1. Domain knowledge 4.2. Content extraction 4.3. Lot zoning 4.4. Lot item detection 4.5. Lot parsing 4.6. XML parsing, data joining, and risk indices development Experiment and Demonstration 5.1. Component evaluation 5.2. System demonstration Discussion 6.1. The ‘industry’ focus of the project 6.2. Data heterogeneity, multilingual and multi-task nature 6.3. The dilemma of algorithmic choices 6.4. The cost of training data Conclusion, Acknowledgements, and References 4.2. Content extraction In this component, our goal is to convert heterogeneous data file formats (Word, Excel, PDF, etc) into a universal, machine accessible JSON format, which we refer to as VUD. For each file, depending on its format, we use the corresponding APIs (e.g., Apache Tika for Word and Excel files, Apache Tesseract and PDFPlumber for PDF). However, not all APIs or data files support the extraction of formatting features (e.g., font size, colour, header level), especially if a document is not structurally tagged. Therefore, we focus on extracting only the text contents and a limited range of structure information. Our VUD documents contains the following structured content: ● Pages, corresponding to the pages in the source document; ● Paragraphs, identified as a consecutive block of texts separated by at least two new line characters. E.g., in Figure 1, ‘North Macedonia-Strumica: Pharmaceutical products’ and ‘2020/S 050-119757’ in the title will be treated as separate paragraphs, while ‘Legal Basis: Directive 2014/24/EU’ is a single paragraph; ● Tables, extracted using the Camelot and Tabula libraries, contain basic tabular structures including column and row indices (including column and row span information), and the textual content within each cell. Tables across multiple pages are recognised as separate tables initially, but are then merged if they meet the following simple rules: 1) there are no other non-tabular structures between two tables; 2) they have the same number of columns. Authors: (1) Ziqi Zhang*, Information School, the University of Sheffield, Regent Court, Sheffield, UKS1 4DP (Ziqi.Zhang@sheffield.ac.uk); (2) Tomas Jasaitis, Vamstar Ltd., London (Tomas.Jasaitis@vamstar.io); (3) Richard Freeman, Vamstar Ltd., London (Richard.Freeman@vamstar.io); (4) Rowida Alfrjani, Information School, the University of Sheffield, Regent Court, Sheffield, UKS1 4DP (Rowida.Alfrjani@sheffield.ac.uk); (5) Adam Funk, Information School, the University of Sheffield, Regent Court, Sheffield, UKS1 4DP (Adam.Funk@sheffield.ac.uk). This paper is available on arxiv under CC BY 4.0 license. Table of Links Abstract and Introduction Domain and Task 2.1. Data sources and complexity 2.2. Task definition Related Work 3.1. Text mining and NLP research overview 3.2. Text mining and NLP in industry use 3.3. Text mining and NLP for procurement 3.4. Conclusion from literature review Proposed Methodology 4.1. Domain knowledge 4.2. Content extraction 4.3. Lot zoning 4.4. Lot item detection 4.5. Lot parsing 4.6. XML parsing, data joining, and risk indices development Experiment and Demonstration 5.1. Component evaluation 5.2. System demonstration Discussion 6.1. The ‘industry’ focus of the project 6.2. Data heterogeneity, multilingual and multi-task nature 6.3. The dilemma of algorithmic choices 6.4. The cost of training data Conclusion, Acknowledgements, and References Abstract and Introduction Abstract and Introduction Abstract and Introduction Abstract and Introduction Domain and Task 2.1. Data sources and complexity 2.2. Task definition Domain and Task Domain and Task 2.1. Data sources and complexity 2.1. Data sources and complexity 2.2. Task definition 2.2. Task definition Related Work 3.1. Text mining and NLP research overview 3.2. Text mining and NLP in industry use 3.3. Text mining and NLP for procurement 3.4. Conclusion from literature review Related Work Related Work 3.1. Text mining and NLP research overview 3.1. Text mining and NLP research overview 3.2. Text mining and NLP in industry use 3.2. Text mining and NLP in industry use 3.3. Text mining and NLP for procurement 3.3. Text mining and NLP for procurement 3.4. Conclusion from literature review 3.4. Conclusion from literature review Proposed Methodology 4.1. Domain knowledge 4.2. Content extraction 4.3. Lot zoning 4.4. Lot item detection 4.5. Lot parsing 4.6. XML parsing, data joining, and risk indices development Proposed Methodology Proposed Methodology Proposed Methodology 4.1. Domain knowledge 4.1. Domain knowledge 4.2. Content extraction 4.2. Content extraction 4.3. Lot zoning 4.3. Lot zoning 4.4. Lot item detection 4.4. Lot item detection 4.5. Lot parsing 4.5. Lot parsing 4.6. XML parsing, data joining, and risk indices development 4.6. XML parsing, data joining, and risk indices development Experiment and Demonstration 5.1. Component evaluation 5.2. System demonstration Experiment and Demonstration Experiment and Demonstration 5.1. Component evaluation 5.1. Component evaluation 5.2. System demonstration 5.2. System demonstration Discussion 6.1. The ‘industry’ focus of the project 6.2. Data heterogeneity, multilingual and multi-task nature 6.3. The dilemma of algorithmic choices 6.4. The cost of training data Discussion Discussion 6.1. The ‘industry’ focus of the project 6.1. The ‘industry’ focus of the project 6.2. Data heterogeneity, multilingual and multi-task nature 6.2. Data heterogeneity, multilingual and multi-task nature 6.3. The dilemma of algorithmic choices 6.3. The dilemma of algorithmic choices 6.4. The cost of training data 6.4. The cost of training data Conclusion, Acknowledgements, and References Conclusion, Acknowledgements, and References Conclusion, Acknowledgements, and References Conclusion, Acknowledgements, and References 4.2. Content extraction In this component, our goal is to convert heterogeneous data file formats (Word, Excel, PDF, etc) into a universal, machine accessible JSON format, which we refer to as VUD. For each file, depending on its format, we use the corresponding APIs (e.g., Apache Tika for Word and Excel files, Apache Tesseract and PDFPlumber for PDF). However, not all APIs or data files support the extraction of formatting features (e.g., font size, colour, header level), especially if a document is not structurally tagged. Therefore, we focus on extracting only the text contents and a limited range of structure information. Our VUD documents contains the following structured content: ● Pages, corresponding to the pages in the source document; ● Paragraphs, identified as a consecutive block of texts separated by at least two new line characters. E.g., in Figure 1, ‘North Macedonia-Strumica: Pharmaceutical products’ and ‘2020/S 050-119757’ in the title will be treated as separate paragraphs, while ‘Legal Basis: Directive 2014/24/EU’ is a single paragraph; ● Tables, extracted using the Camelot and Tabula libraries, contain basic tabular structures including column and row indices (including column and row span information), and the textual content within each cell. Tables across multiple pages are recognised as separate tables initially, but are then merged if they meet the following simple rules: 1) there are no other non-tabular structures between two tables; 2) they have the same number of columns. Authors: (1) Ziqi Zhang*, Information School, the University of Sheffield, Regent Court, Sheffield, UKS1 4DP (Ziqi.Zhang@sheffield.ac.uk); (2) Tomas Jasaitis, Vamstar Ltd., London (Tomas.Jasaitis@vamstar.io); (3) Richard Freeman, Vamstar Ltd., London (Richard.Freeman@vamstar.io); (4) Rowida Alfrjani, Information School, the University of Sheffield, Regent Court, Sheffield, UKS1 4DP (Rowida.Alfrjani@sheffield.ac.uk); (5) Adam Funk, Information School, the University of Sheffield, Regent Court, Sheffield, UKS1 4DP (Adam.Funk@sheffield.ac.uk). Authors: Authors: (1) Ziqi Zhang*, Information School, the University of Sheffield, Regent Court, Sheffield, UKS1 4DP (Ziqi.Zhang@sheffield.ac.uk); (2) Tomas Jasaitis, Vamstar Ltd., London (Tomas.Jasaitis@vamstar.io); (3) Richard Freeman, Vamstar Ltd., London (Richard.Freeman@vamstar.io); (4) Rowida Alfrjani, Information School, the University of Sheffield, Regent Court, Sheffield, UKS1 4DP (Rowida.Alfrjani@sheffield.ac.uk); (5) Adam Funk, Information School, the University of Sheffield, Regent Court, Sheffield, UKS1 4DP (Adam.Funk@sheffield.ac.uk). This paper is available on arxiv under CC BY 4.0 license. This paper is available on arxiv under CC BY 4.0 license. available on arxiv available on arxiv