Table of Links Abstract and Introduction


Domain and Task
2.1. Data sources and complexity
2.2. Task definition


Related Work
3.1. Text mining and NLP research overview
3.2. Text mining and NLP in industry use
3.3. Text mining and NLP for procurement
3.4. Conclusion from literature review


Proposed Methodology
4.1. Domain knowledge
4.2. Content extraction
4.3. Lot zoning
4.4. Lot item detection
4.5. Lot parsing
4.6. XML parsing, data joining, and risk indices development


Experiment and Demonstration
5.1. Component evaluation
5.2. System demonstration


Discussion
6.1. The ‘industry’ focus of the project
6.2. Data heterogeneity, multilingual and multi-task nature
6.3. The dilemma of algorithmic choices
6.4. The cost of training data


Conclusion, Acknowledgements, and References 6. Discussion In this section, we discuss lessons learned from the project that may potentially inform future research and practice. These will be covered from several angles: the different focus of industry project compared to research, the complexity of building a full NLP pipeline for heterogeneous data and its implication on the development, the dilemma of choosing between the more advanced NLP methods and those earlier classic methods, and the issue of training data in an industry context. 6.1. The ‘industry’ focus of the project There are very few studies that explicitly discuss the different focuses between industry and research projects and how these could impact the approach. Among these (Chiticariu et al., 2013; Suganthan et al., 2015; Krishna et al., 2016), it is widely acknowledged that the lack of training data and the associated cost of creating it, the fast development cycle, the need for interpretability of machine learning predictions, and the continuous update of the model due to evolving business needs and knowledge are factors that typically render a research-focused approach impractical. Industry projects focus on ‘problem solving’ in real world contexts with (often harsh) time and resource constraints, while research projects focus on ‘creative solutions’ to problems often studied in an ideal ‘lab’ environment that may not fully represent the reality. Our experience has supported most - if not all - of the viewpoints above. The industry project has a clear focus of developing a ‘proof-of-concept’ product within a time limit of one year and financial budget. The business partner needs to be able to take the finished product for further development and/or extension beyond the project. Analysis at the beginning of the project showed its ‘multi-task’ and ‘heterogeneous data’ nature. Combining all these factors, a typical research-driven approach aimed at developing ‘novel’solutions built on insights from rigorous critical evaluation of multiple baselines and state-of-the-art using research ‘benchmarks’ that may deviate from the real data, is infeasible. Instead, one lesson we learned is that one needs to opt for ‘tried and tested’ methods that have low ‘barriers’ to users, who may need to take the solution forward for future development. This is a fundamental principle that needs to be taken into account when evaluating other aspects of the project. Authors:
(1) Ziqi Zhang*, Information School, the University of Sheffield, Regent Court, Sheffield, UKS1 4DP (Ziqi.Zhang@sheffield.ac.uk);
(2) Tomas Jasaitis, Vamstar Ltd., London (Tomas.Jasaitis@vamstar.io);
(3) Richard Freeman, Vamstar Ltd., London (Richard.Freeman@vamstar.io);
(4) Rowida Alfrjani, Information School, the University of Sheffield, Regent Court, Sheffield, UKS1 4DP (Rowida.Alfrjani@sheffield.ac.uk);
(5) Adam Funk, Information School, the University of Sheffield, Regent Court, Sheffield, UKS1 4DP (Adam.Funk@sheffield.ac.uk). This paper is available on arxiv under CC BY 4.0 license. Table of Links Abstract and Introduction Domain and Task
2.1. Data sources and complexity
2.2. Task definition Related Work
3.1. Text mining and NLP research overview
3.2. Text mining and NLP in industry use
3.3. Text mining and NLP for procurement
3.4. Conclusion from literature review Proposed Methodology
4.1. Domain knowledge
4.2. Content extraction
4.3. Lot zoning
4.4. Lot item detection
4.5. Lot parsing
4.6. XML parsing, data joining, and risk indices development Experiment and Demonstration
5.1. Component evaluation
5.2. System demonstration Discussion
6.1. The ‘industry’ focus of the project
6.2. Data heterogeneity, multilingual and multi-task nature
6.3. The dilemma of algorithmic choices
6.4. The cost of training data Conclusion, Acknowledgements, and References Abstract and Introduction Abstract and Introduction Abstract and Introduction Abstract and Introduction Domain and Task 2.1. Data sources and complexity 2.2. Task definition Domain and Task Domain and Task 2.1. Data sources and complexity 2.1. Data sources and complexity 2.2. Task definition 2.2. Task definition Related Work 3.1. Text mining and NLP research overview 3.2. Text mining and NLP in industry use 3.3. Text mining and NLP for procurement 3.4. Conclusion from literature review Related Work Related Work 3.1. Text mining and NLP research overview 3.1. Text mining and NLP research overview 3.2. Text mining and NLP in industry use 3.2. Text mining and NLP in industry use 3.3. Text mining and NLP for procurement 3.3. Text mining and NLP for procurement 3.4. Conclusion from literature review 3.4. Conclusion from literature review Proposed Methodology 4.1. Domain knowledge 4.2. Content extraction 4.3. Lot zoning 4.4. Lot item detection 4.5. Lot parsing 4.6. XML parsing, data joining, and risk indices development Proposed Methodology Proposed Methodology Proposed Methodology 4.1. Domain knowledge 4.1. Domain knowledge 4.2. Content extraction 4.2. Content extraction 4.3. Lot zoning 4.3. Lot zoning 4.4. Lot item detection 4.4. Lot item detection 4.5. Lot parsing 4.5. Lot parsing 4.6. XML parsing, data joining, and risk indices development 4.6. XML parsing, data joining, and risk indices development Experiment and Demonstration 5.1. Component evaluation 5.2. System demonstration Experiment and Demonstration Experiment and Demonstration 5.1. Component evaluation 5.1. Component evaluation 5.2. System demonstration 5.2. System demonstration Discussion 6.1. The ‘industry’ focus of the project 6.2. Data heterogeneity, multilingual and multi-task nature 6.3. The dilemma of algorithmic choices 6.4. The cost of training data Discussion Discussion 6.1. The ‘industry’ focus of the project 6.1. The ‘industry’ focus of the project 6.2. Data heterogeneity, multilingual and multi-task nature 6.2. Data heterogeneity, multilingual and multi-task nature 6.3. The dilemma of algorithmic choices 6.3. The dilemma of algorithmic choices 6.4. The cost of training data 6.4. The cost of training data Conclusion, Acknowledgements, and References Conclusion, Acknowledgements, and References Conclusion, Acknowledgements, and References Conclusion, Acknowledgements, and References 6. Discussion In this section, we discuss lessons learned from the project that may potentially inform future research and practice. These will be covered from several angles: the different focus of industry project compared to research, the complexity of building a full NLP pipeline for heterogeneous data and its implication on the development, the dilemma of choosing between the more advanced NLP methods and those earlier classic methods, and the issue of training data in an industry context. 6.1. The ‘industry’ focus of the project There are very few studies that explicitly discuss the different focuses between industry and research projects and how these could impact the approach. Among these (Chiticariu et al., 2013; Suganthan et al., 2015; Krishna et al., 2016), it is widely acknowledged that the lack of training data and the associated cost of creating it, the fast development cycle, the need for interpretability of machine learning predictions, and the continuous update of the model due to evolving business needs and knowledge are factors that typically render a research-focused approach impractical. Industry projects focus on ‘problem solving’ in real world contexts with (often harsh) time and resource constraints, while research projects focus on ‘creative solutions’ to problems often studied in an ideal ‘lab’ environment that may not fully represent the reality. Our experience has supported most - if not all - of the viewpoints above. The industry project has a clear focus of developing a ‘proof-of-concept’ product within a time limit of one year and financial budget. The business partner needs to be able to take the finished product for further development and/or extension beyond the project. Analysis at the beginning of the project showed its ‘multi-task’ and ‘heterogeneous data’ nature. Combining all these factors, a typical research-driven approach aimed at developing ‘novel’solutions built on insights from rigorous critical evaluation of multiple baselines and state-of-the-art using research ‘benchmarks’ that may deviate from the real data, is infeasible. Instead, one lesson we learned is that one needs to opt for ‘tried and tested’ methods that have low ‘barriers’ to users, who may need to take the solution forward for future development. This is a fundamental principle that needs to be taken into account when evaluating other aspects of the project. Authors: (1) Ziqi Zhang*, Information School, the University of Sheffield, Regent Court, Sheffield, UKS1 4DP (Ziqi.Zhang@sheffield.ac.uk); (2) Tomas Jasaitis, Vamstar Ltd., London (Tomas.Jasaitis@vamstar.io); (3) Richard Freeman, Vamstar Ltd., London (Richard.Freeman@vamstar.io); (4) Rowida Alfrjani, Information School, the University of Sheffield, Regent Court, Sheffield, UKS1 4DP (Rowida.Alfrjani@sheffield.ac.uk); (5) Adam Funk, Information School, the University of Sheffield, Regent Court, Sheffield, UKS1 4DP (Adam.Funk@sheffield.ac.uk). Authors: Authors: (1) Ziqi Zhang*, Information School, the University of Sheffield, Regent Court, Sheffield, UKS1 4DP (Ziqi.Zhang@sheffield.ac.uk); (2) Tomas Jasaitis, Vamstar Ltd., London (Tomas.Jasaitis@vamstar.io); (3) Richard Freeman, Vamstar Ltd., London (Richard.Freeman@vamstar.io); (4) Rowida Alfrjani, Information School, the University of Sheffield, Regent Court, Sheffield, UKS1 4DP (Rowida.Alfrjani@sheffield.ac.uk); (5) Adam Funk, Information School, the University of Sheffield, Regent Court, Sheffield, UKS1 4DP (Adam.Funk@sheffield.ac.uk). This paper is available on arxiv under CC BY 4.0 license. This paper is available on arxiv under CC BY 4.0 license. available on arxiv available on arxiv

Part of HackerNoon's growing list of open-source research papers, promoting free access to academic material.

Why NLP Projects in Business Aren’t the Same as Research

About Author

Comments

TOPICS

THIS ARTICLE WAS FEATURED IN

Related Stories

A New Era for Procurement Text Mining

Using AI to Analyze Healthcare Procurement Documents and Assess Supplier Risks

How Healthcare Procurement Data is Being Used to Evaluate Supplier Reliability

How to Build Supplier Risk Profiles

How Text Mining Can Simplify the Complexities of Procurement Data

New Study Shows How Text Mining and NLP Transform Legal, E-commerce, and Construction Industries

A New Era for Procurement Text Mining

Using AI to Analyze Healthcare Procurement Documents and Assess Supplier Risks

How Healthcare Procurement Data is Being Used to Evaluate Supplier Reliability

How to Build Supplier Risk Profiles

How Text Mining Can Simplify the Complexities of Procurement Data

New Study Shows How Text Mining and NLP Transform Legal, E-commerce, and Construction Industries

Light-Mode

Classic

Newspaper

Minty

Dark-Mode

Neon Noir

Minty

HN StartUps