This story draft by @textmining has not been reviewed by an editor, YET.
Domain and Task
Related Work
3.1. Text mining and NLP research overview
3.2. Text mining and NLP in industry use
4.6. XML parsing, data joining, and risk indices development
Experiment and Demonstration
Discussion
6.1. The ‘industry’ focus of the project
6.2. Data heterogeneity, multilingual and multi-task nature
Figure 5 presents an overview of our workflow. As mentioned before, this article focuses on extracting the structured lot and item information often missing in tender and award XMLs (the middle lane). This will be covered in Sections 4.1 to 4.5. In Section 4.6, we briefly cover the other parts of the workflow.
Given a collection of tender attachment documents associated with one tender notice, our first step (content extraction) is to use various data extraction libraries to convert various content formats into a single, universal data structure called ‘Vamstar Universal Documents (VUD)’ that represent text content in JSON format. In ‘lot zoning’, we use passage retrieval/selection techniques to identify the content areas (pages and tables) that potentially contain useful lot information. Next, ‘lot item detection’ uses text classification techniques to process the extracted text passages to identify content (sentences and table rows) that describe a lot and its items. Following this, we apply rule-based NER to parse the texts related to a lot and its individual items to identify specific attributes and create a structured representation of the lot (lot parsing). For most of these processes, we use domain knowledge in a generalisable way for multiple languages.
Meanwhile, other structured information is extracted from tender and award XMLs by simple XML parsing. These extracted information is then joined to form supplier-centric contract award records, which are used to populate our database. The database is then used to calculate supplier risk indices, which form a supplier risk profile. In the following sections, we explain each component. However, details of certain components may need to be redacted due to NDAs on proprietary content.
Authors:
(1) Ziqi Zhang*, Information School, the University of Sheffield, Regent Court, Sheffield, UKS1 4DP ([email protected]);
(2) Tomas Jasaitis, Vamstar Ltd., London ([email protected]);
(3) Richard Freeman, Vamstar Ltd., London ([email protected]);
(4) Rowida Alfrjani, Information School, the University of Sheffield, Regent Court, Sheffield, UKS1 4DP ([email protected]);
(5) Adam Funk, Information School, the University of Sheffield, Regent Court, Sheffield, UKS1 4DP ([email protected]).
This paper is