paint-brush

This story draft by @textmining has not been reviewed by an editor, YET.

Why NLP Projects in Business Aren’t the Same as Research

featured image - Why NLP Projects in Business Aren’t the Same as Research
Text Mining HackerNoon profile picture
0-item

Table of Links

  1. Abstract and Introduction

  2. Domain and Task

    2.1. Data sources and complexity

    2.2. Task definition

  3. Related Work

    3.1. Text mining and NLP research overview

    3.2. Text mining and NLP in industry use

    3.3. Text mining and NLP for procurement

    3.4. Conclusion from literature review

  4. Proposed Methodology

    4.1. Domain knowledge

    4.2. Content extraction

    4.3. Lot zoning

    4.4. Lot item detection

    4.5. Lot parsing

    4.6. XML parsing, data joining, and risk indices development

  5. Experiment and Demonstration

    5.1. Component evaluation

    5.2. System demonstration

  6. Discussion

    6.1. The ‘industry’ focus of the project

    6.2. Data heterogeneity, multilingual and multi-task nature

    6.3. The dilemma of algorithmic choices

    6.4. The cost of training data

  7. Conclusion, Acknowledgements, and References

6. Discussion

In this section, we discuss lessons learned from the project that may potentially inform future research and practice. These will be covered from several angles: the different focus of industry project compared to research, the complexity of building a full NLP pipeline for heterogeneous data and its implication on the development, the dilemma of choosing between the more advanced NLP methods and those earlier classic methods, and the issue of training data in an industry context.

6.1. The ‘industry’ focus of the project

There are very few studies that explicitly discuss the different focuses between industry and research projects and how these could impact the approach. Among these (Chiticariu et al., 2013; Suganthan et al., 2015; Krishna et al., 2016), it is widely acknowledged that the lack of training data and the associated cost of creating it, the fast development cycle, the need for interpretability of machine learning predictions, and the continuous update of the model due to evolving business needs and knowledge are factors that typically render a research-focused approach impractical. Industry projects focus on ‘problem solving’ in real world contexts with (often harsh) time and resource constraints, while research projects focus on ‘creative solutions’ to problems often studied in an ideal ‘lab’ environment that may not fully represent the reality.


Our experience has supported most - if not all - of the viewpoints above. The industry project has a clear focus of developing a ‘proof-of-concept’ product within a time limit of one year and financial budget. The business partner needs to be able to take the finished product for further development and/or extension beyond the project. Analysis at the beginning of the project showed its ‘multi-task’ and ‘heterogeneous data’ nature. Combining all these factors, a typical research-driven approach aimed at developing ‘novel’solutions built on insights from rigorous critical evaluation of multiple baselines and state-of-the-art using research ‘benchmarks’ that may deviate from the real data, is infeasible. Instead, one lesson we learned is that one needs to opt for ‘tried and tested’ methods that have low ‘barriers’ to users, who may need to take the solution forward for future development. This is a fundamental principle that needs to be taken into account when evaluating other aspects of the project.


Authors:

(1) Ziqi Zhang*, Information School, the University of Sheffield, Regent Court, Sheffield, UKS1 4DP ([email protected]);

(2) Tomas Jasaitis, Vamstar Ltd., London ([email protected]);

(3) Richard Freeman, Vamstar Ltd., London ([email protected]);

(4) Rowida Alfrjani, Information School, the University of Sheffield, Regent Court, Sheffield, UKS1 4DP ([email protected]);

(5) Adam Funk, Information School, the University of Sheffield, Regent Court, Sheffield, UKS1 4DP ([email protected]).


This paper is available on arxiv under CC BY 4.0 license.