Table of Links Abstract and Introduction


Domain and Task
2.1. Data sources and complexity
2.2. Task definition


Related Work
3.1. Text mining and NLP research overview
3.2. Text mining and NLP in industry use
3.3. Text mining and NLP for procurement
3.4. Conclusion from literature review


Proposed Methodology
4.1. Domain knowledge
4.2. Content extraction
4.3. Lot zoning
4.4. Lot item detection
4.5. Lot parsing
4.6. XML parsing, data joining, and risk indices development


Experiment and Demonstration
5.1. Component evaluation
5.2. System demonstration


Discussion
6.1. The ‘industry’ focus of the project
6.2. Data heterogeneity, multilingual and multi-task nature
6.3. The dilemma of algorithmic choices
6.4. The cost of training data


Conclusion, Acknowledgements, and References 3.2. Text mining and NLP in industry use Text mining and NLP already have wide use in a number of industry contexts. Here, due to limited space, we only look at a few domains briefly and focus only on work for real applications instead of general purpose tasks such as building domain corpora or language models, or fundamental NLP research adapted to domain specific data. In the legal domain, Zhong et al. (2020) summarised three main application areas: judicial decision prediction, similar case matching and question answering. Judicial decision prediction studies the problem of determining the verdict of a court on the basis of textual information about a court case before the verdict was made. This is typically treated as a text classification task, where court decisions are categorised and they are predicted based on features extracted from the case texts. For details, we refer readers to a survey by Francia et al. (2022). Similar case matching is an important application where judicial decisions are made according to similar and representative cases in the past. The goal is to find pairs of similar cases. This is often cast as a retrieval task, and is being extensively studied in Legal IR (Xiao et al., 2019). Similar to judicial decision prediction, existing methods mainly focus on comparing case texts at word or semantic level. In both areas, recent studies have argued for extracting more fine-grained ‘case element’ information (e.g., through using NER) that may better represent legal cases. For example, Hu et al. (2018) extracted ‘legal attributes’ that define the nature of cases in judicial decision prediction. Question answering is a rather complicated NLP task that builds on many low-level tasks, such as finding relevant text passages (PR and text classification), locating case elements (NER and relation extraction), and text similarity matching. QA is not related to our work. For both reasons, we do not go into detail about legal QA. But generally, questions typically concern explanation of legal concepts or case analysis. Legal documents are often lengthy, but well-structured into different parts and follow standard structures that make them easier to process. Over the years, the research community in this domain has defined granular tasks, standards, and created rich resources including training data. In contrast, our work deals with much more heterogeneous data and inconsistent structures, where creating training data for some tasks is expensive. Most established methods in this domain are therefore not directly transferable to our task. With the growth of e-commerce and social media, mining Web content related to service/product provisions has been extensively studied and applied to develop competitive advantage for businesses. A major area of application is analysing reviews from e-commerce or social media platforms to acquire business intelligence that inform various practices. For example, through analysing customer reviews about their own services or products, one can gain insights on customer satisfaction, to inform public relation management (Nave et al., 2018) and product development (Yang et al., 2019). Analysing such data about a sector in general including multiple competitors allows discovery of market trends and even forecast of demands (Chatterjee, 2019; Sharma et al,. 2020). Further, with the increase of platforms of user-generated content, the diffusion of fake information is made easier and becoming an increasing concern. Therefore, work has been done to automatically detect and filter such misinformation (Fang et al., 2020). A thorough review of work in the above areas can be found in Kuma et al. (2021). Work in these areas heavily rely on sentiment analysis, which is a type of text classification task aiming to analyse the sentiment of a text. It is also very useful to extract key elements related to service/product provision (e.g., product features, service processes) for more fine-grained analysis, and this can benefit from NER methods. Many studies also use topic modelling and social network analysis, which are beyond the scope of this work. Compared to the domain in our work, data studied in these areas are rather homogeneous - they are typically free-form review texts that are independent and self-contained, making it easy to adapt state-of-the-art methods. Partially for this reason, there is an abundance of tools developed in this area. However, as explained before, the data we have to deal with is much more complex. Text mining and NLP are also widely used in system requirements analysis and quality control in construction. To name a few, Li et al. (2015) processed software user manuals (PDFs) to extract functional and non-functional requirements. A dictionary of keywords combined with rules are used to extract sentences and topic modelling is applied to group similar sentences into groups, which may represent the same/similar requirements. The work also needed to deal with filtering irrelevant text passages in the extracted texts from PDFs, but this was done manually. Our work is similar in that some of our methods make use of keywords and rules, and also process PDFs. However, we need to deal with a much larger and more heterogeneous collection and the filtering of content must be done automatically. Tiun et al. (2020) developed supervised classifiers of functional and nonfunctional system requirements using well-curated training sentences obtained from Siemens Logistics and Automotive Organization. Khan et al. (2020) analysed Reddit posts about Google Maps, and trained a text classifier to discover posts related to the software feature requirements or issues using a set of manually annotated posts. Both deal with higher quality of data and have access to a large amount of high quality training data, while our project only managed to create limited training data for some but not all tasks due to resource constraints. Lee et al. (2014) applied keywords extraction and co-occurrence analysis in marine structure quality inspection reports. The co-occurrence map is used to aid humans in identifying which quality aspects are most frequently mentioned and in what ways. Similar methods are used in Tian et al. (2021), who processed construction project reports to extract keywords and concepts related to several key aspects of project management (e.g., quality control, safety management). Zhang et al. (2019) developed a text classifier to classify accident reports from construction projects, and implemented a rule-based extractor to identify causes from accidents. The data used in these studies are also rather simple compared to ours. For example, report cases used by Zhang et al. (2019) are composed of short paragraphs and use a consistent structure. Authors:
(1) Ziqi Zhang*, Information School, the University of Sheffield, Regent Court, Sheffield, UKS1 4DP (Ziqi.Zhang@sheffield.ac.uk);
(2) Tomas Jasaitis, Vamstar Ltd., London (Tomas.Jasaitis@vamstar.io);
(3) Richard Freeman, Vamstar Ltd., London (Richard.Freeman@vamstar.io);
(4) Rowida Alfrjani, Information School, the University of Sheffield, Regent Court, Sheffield, UKS1 4DP (Rowida.Alfrjani@sheffield.ac.uk);
(5) Adam Funk, Information School, the University of Sheffield, Regent Court, Sheffield, UKS1 4DP (Adam.Funk@sheffield.ac.uk). This paper is available on arxiv under CC BY 4.0 license. Table of Links Abstract and Introduction Domain and Task
2.1. Data sources and complexity
2.2. Task definition Related Work
3.1. Text mining and NLP research overview
3.2. Text mining and NLP in industry use
3.3. Text mining and NLP for procurement
3.4. Conclusion from literature review Proposed Methodology
4.1. Domain knowledge
4.2. Content extraction
4.3. Lot zoning
4.4. Lot item detection
4.5. Lot parsing
4.6. XML parsing, data joining, and risk indices development Experiment and Demonstration
5.1. Component evaluation
5.2. System demonstration Discussion
6.1. The ‘industry’ focus of the project
6.2. Data heterogeneity, multilingual and multi-task nature
6.3. The dilemma of algorithmic choices
6.4. The cost of training data Conclusion, Acknowledgements, and References Abstract and Introduction Abstract and Introduction Abstract and Introduction Abstract and Introduction Domain and Task 2.1. Data sources and complexity 2.2. Task definition Domain and Task Domain and Task 2.1. Data sources and complexity 2.1. Data sources and complexity 2.2. Task definition 2.2. Task definition Related Work 3.1. Text mining and NLP research overview 3.2. Text mining and NLP in industry use 3.3. Text mining and NLP for procurement 3.4. Conclusion from literature review Related Work Related Work 3.1. Text mining and NLP research overview 3.1. Text mining and NLP research overview 3.2. Text mining and NLP in industry use 3.2. Text mining and NLP in industry use 3.3. Text mining and NLP for procurement 3.3. Text mining and NLP for procurement 3.4. Conclusion from literature review 3.4. Conclusion from literature review Proposed Methodology 4.1. Domain knowledge 4.2. Content extraction 4.3. Lot zoning 4.4. Lot item detection 4.5. Lot parsing 4.6. XML parsing, data joining, and risk indices development Proposed Methodology Proposed Methodology Proposed Methodology 4.1. Domain knowledge 4.1. Domain knowledge 4.2. Content extraction 4.2. Content extraction 4.3. Lot zoning 4.3. Lot zoning 4.4. Lot item detection 4.4. Lot item detection 4.5. Lot parsing 4.5. Lot parsing 4.6. XML parsing, data joining, and risk indices development 4.6. XML parsing, data joining, and risk indices development Experiment and Demonstration 5.1. Component evaluation 5.2. System demonstration Experiment and Demonstration Experiment and Demonstration 5.1. Component evaluation 5.1. Component evaluation 5.2. System demonstration 5.2. System demonstration Discussion 6.1. The ‘industry’ focus of the project 6.2. Data heterogeneity, multilingual and multi-task nature 6.3. The dilemma of algorithmic choices 6.4. The cost of training data Discussion Discussion 6.1. The ‘industry’ focus of the project 6.1. The ‘industry’ focus of the project 6.2. Data heterogeneity, multilingual and multi-task nature 6.2. Data heterogeneity, multilingual and multi-task nature 6.3. The dilemma of algorithmic choices 6.3. The dilemma of algorithmic choices 6.4. The cost of training data 6.4. The cost of training data Conclusion, Acknowledgements, and References Conclusion, Acknowledgements, and References Conclusion, Acknowledgements, and References Conclusion, Acknowledgements, and References 3.2. Text mining and NLP in industry use Text mining and NLP already have wide use in a number of industry contexts. Here, due to limited space, we only look at a few domains briefly and focus only on work for real applications instead of general purpose tasks such as building domain corpora or language models, or fundamental NLP research adapted to domain specific data. In the legal domain , Zhong et al. (2020) summarised three main application areas: judicial decision prediction, similar case matching and question answering. Judicial decision prediction studies the problem of determining the verdict of a court on the basis of textual information about a court case before the verdict was made. This is typically treated as a text classification task, where court decisions are categorised and they are predicted based on features extracted from the case texts. For details, we refer readers to a survey by Francia et al. (2022). Similar case matching is an important application where judicial decisions are made according to similar and representative cases in the past. The goal is to find pairs of similar cases. This is often cast as a retrieval task, and is being extensively studied in Legal IR (Xiao et al., 2019). Similar to judicial decision prediction, existing methods mainly focus on comparing case texts at word or semantic level. In both areas, recent studies have argued for extracting more fine-grained ‘case element’ information (e.g., through using NER) that may better represent legal cases. For example, Hu et al. (2018) extracted ‘legal attributes’ that define the nature of cases in judicial decision prediction. Question answering is a rather complicated NLP task that builds on many low-level tasks, such as finding relevant text passages (PR and text classification), locating case elements (NER and relation extraction), and text similarity matching. QA is not related to our work. For both reasons, we do not go into detail about legal QA. But generally, questions typically concern explanation of legal concepts or case analysis. legal domain Legal documents are often lengthy, but well-structured into different parts and follow standard structures that make them easier to process. Over the years, the research community in this domain has defined granular tasks, standards, and created rich resources including training data. In contrast, our work deals with much more heterogeneous data and inconsistent structures, where creating training data for some tasks is expensive. Most established methods in this domain are therefore not directly transferable to our task. With the growth of e-commerce and social media, mining Web content related to service/product provisions has been extensively studied and applied to develop competitive advantage for businesses. A major area of application is analysing reviews from e-commerce or social media platforms to acquire business intelligence that inform various practices. For example, through analysing customer reviews about their own services or products, one can gain insights on customer satisfaction, to inform public relation management (Nave et al., 2018) and product development (Yang et al., 2019). Analysing such data about a sector in general including multiple competitors allows discovery of market trends and even forecast of demands (Chatterjee, 2019; Sharma et al,. 2020). Further, with the increase of platforms of user-generated content, the diffusion of fake information is made easier and becoming an increasing concern. Therefore, work has been done to automatically detect and filter such misinformation (Fang et al., 2020). A thorough review of work in the above areas can be found in Kuma et al. (2021). service/product provisions Work in these areas heavily rely on sentiment analysis, which is a type of text classification task aiming to analyse the sentiment of a text. It is also very useful to extract key elements related to service/product provision (e.g., product features, service processes) for more fine-grained analysis, and this can benefit from NER methods. Many studies also use topic modelling and social network analysis, which are beyond the scope of this work. Compared to the domain in our work, data studied in these areas are rather homogeneous - they are typically free-form review texts that are independent and self-contained, making it easy to adapt state-of-the-art methods. Partially for this reason, there is an abundance of tools developed in this area. However, as explained before, the data we have to deal with is much more complex. Text mining and NLP are also widely used in system requirements analysis and quality control in construction . To name a few, Li et al. (2015) processed software user manuals (PDFs) to extract functional and non-functional requirements. A dictionary of keywords combined with rules are used to extract sentences and topic modelling is applied to group similar sentences into groups, which may represent the same/similar requirements. The work also needed to deal with filtering irrelevant text passages in the extracted texts from PDFs, but this was done manually. Our work is similar in that some of our methods make use of keywords and rules, and also process PDFs. However, we need to deal with a much larger and more heterogeneous collection and the filtering of content must be done automatically. Tiun et al. (2020) developed supervised classifiers of functional and nonfunctional system requirements using well-curated training sentences obtained from Siemens Logistics and Automotive Organization. Khan et al. (2020) analysed Reddit posts about Google Maps, and trained a text classifier to discover posts related to the software feature requirements or issues using a set of manually annotated posts. Both deal with higher quality of data and have access to a large amount of high quality training data, while our project only managed to create limited training data for some but not all tasks due to resource constraints. system requirements analysis quality control in construction Lee et al. (2014) applied keywords extraction and co-occurrence analysis in marine structure quality inspection reports. The co-occurrence map is used to aid humans in identifying which quality aspects are most frequently mentioned and in what ways. Similar methods are used in Tian et al. (2021), who processed construction project reports to extract keywords and concepts related to several key aspects of project management (e.g., quality control, safety management). Zhang et al. (2019) developed a text classifier to classify accident reports from construction projects, and implemented a rule-based extractor to identify causes from accidents. The data used in these studies are also rather simple compared to ours. For example, report cases used by Zhang et al. (2019) are composed of short paragraphs and use a consistent structure. Authors: (1) Ziqi Zhang*, Information School, the University of Sheffield, Regent Court, Sheffield, UKS1 4DP (Ziqi.Zhang@sheffield.ac.uk); (2) Tomas Jasaitis, Vamstar Ltd., London (Tomas.Jasaitis@vamstar.io); (3) Richard Freeman, Vamstar Ltd., London (Richard.Freeman@vamstar.io); (4) Rowida Alfrjani, Information School, the University of Sheffield, Regent Court, Sheffield, UKS1 4DP (Rowida.Alfrjani@sheffield.ac.uk); (5) Adam Funk, Information School, the University of Sheffield, Regent Court, Sheffield, UKS1 4DP (Adam.Funk@sheffield.ac.uk). Authors: Authors: (1) Ziqi Zhang*, Information School, the University of Sheffield, Regent Court, Sheffield, UKS1 4DP (Ziqi.Zhang@sheffield.ac.uk); (2) Tomas Jasaitis, Vamstar Ltd., London (Tomas.Jasaitis@vamstar.io); (3) Richard Freeman, Vamstar Ltd., London (Richard.Freeman@vamstar.io); (4) Rowida Alfrjani, Information School, the University of Sheffield, Regent Court, Sheffield, UKS1 4DP (Rowida.Alfrjani@sheffield.ac.uk); (5) Adam Funk, Information School, the University of Sheffield, Regent Court, Sheffield, UKS1 4DP (Adam.Funk@sheffield.ac.uk). This paper is available on arxiv under CC BY 4.0 license. This paper is available on arxiv under CC BY 4.0 license. available on arxiv available on arxiv

Part of HackerNoon's growing list of open-source research papers, promoting free access to academic material.

New Study Shows How Text Mining and NLP Transform Legal, E-commerce, and Construction Industries

About Author

Comments

TOPICS

THIS ARTICLE WAS FEATURED IN

Related Stories

Untitled Story

A New Era for Procurement Text Mining

Using AI to Analyze Healthcare Procurement Documents and Assess Supplier Risks

How Healthcare Procurement Data is Being Used to Evaluate Supplier Reliability

How to Build Supplier Risk Profiles

How Text Mining Can Simplify the Complexities of Procurement Data

A New Era for Procurement Text Mining

Using AI to Analyze Healthcare Procurement Documents and Assess Supplier Risks

How Healthcare Procurement Data is Being Used to Evaluate Supplier Reliability

How to Build Supplier Risk Profiles

How Text Mining Can Simplify the Complexities of Procurement Data

Light-Mode

Classic

Newspaper

Dark-Mode

Neon Noir

Minty

HN StartUps