paint-brush
The High Cost of Training Data in NLP Projectsby@textmining
New Story

The High Cost of Training Data in NLP Projects

by Text MiningDecember 26th, 2024
Read on Terminal Reader
Read this story w/o Javascript
tldt arrow

Too Long; Didn't Read

The high cost of developing training data for supervised models, especially for granular tasks like lot item detection, makes rule-based methods more practical in industry. Training data costs, labor, and the need for specialized tools often outweigh the benefits of fully supervised approaches.
featured image - The High Cost of Training Data in NLP Projects
Text Mining HackerNoon profile picture
  1. Abstract and Introduction

  2. Domain and Task

    2.1. Data sources and complexity

    2.2. Task definition

  3. Related Work

    3.1. Text mining and NLP research overview

    3.2. Text mining and NLP in industry use

    3.3. Text mining and NLP for procurement

    3.4. Conclusion from literature review

  4. Proposed Methodology

    4.1. Domain knowledge

    4.2. Content extraction

    4.3. Lot zoning

    4.4. Lot item detection

    4.5. Lot parsing

    4.6. XML parsing, data joining, and risk indices development

  5. Experiment and Demonstration

    5.1. Component evaluation

    5.2. System demonstration

  6. Discussion

    6.1. The ‘industry’ focus of the project

    6.2. Data heterogeneity, multilingual and multi-task nature

    6.3. The dilemma of algorithmic choices

    6.4. The cost of training data

  7. Conclusion, Acknowledgements, and References

6.4. The cost of training data

In the project, we used a mixture of supervised and unsupervised (rules) methods. There are many practical reasons for this choice, but contrary to the scientific literature that is predominantly based on supervised machine learning methods (Suganthan et al., 2015), the main reasons for not opting for a fully supervised approach is the cost. This may be explained from many factors, such as the human labour, the time required, the number (and complexity) of tasks to be addressed, and the tools needed for developing the training data.


The project uses a process that involves multiple tasks of varying complexity. Developing training data for the two zoning tasks is relatively low-cost, as they do not require specialist tools to support the annotation process. In both tasks, annotators were given the extracted text passages either in raw text or CSV (for tables only), together with the original document for reference. The annotation task is also fairly easy, as it only requires skimming through these documents, and does not require extensive reading. In contrast, annotating lot items (lot item detection) and their specific attributes (lot parsing) are much lower granularity tasks, and require examining content at sentence or phrase level. The data also needs to be transformed into formats that can be loaded into state of the art annotation tools that are not always available for our data format. For example, the widely used ‘spacyNER’ annotation format requires parallel sequences of tokens and their labels for each sentence. Such data is very expensive to develop from a business perspective.


Further, as already raised in earlier studies, industrial solutions often favour rules over supervised models as the latter offers easier interpretability and maintainability. These are crucial factors as the needs for their applications can evolve over time and therefore, the solutions must be easy to modify over time. In these cases, changing rules may be a much lighter effort than recreating training data and retraining the system. For these reasons, training data for supervised models can be a barrier to the adoption of methods developed in the research communities, when it comes to developing industrial applications.


Authors:

(1) Ziqi Zhang*, Information School, the University of Sheffield, Regent Court, Sheffield, UKS1 4DP ([email protected]);

(2) Tomas Jasaitis, Vamstar Ltd., London ([email protected]);

(3) Richard Freeman, Vamstar Ltd., London ([email protected]);

(4) Rowida Alfrjani, Information School, the University of Sheffield, Regent Court, Sheffield, UKS1 4DP ([email protected]);

(5) Adam Funk, Information School, the University of Sheffield, Regent Court, Sheffield, UKS1 4DP ([email protected]).


This paper is available on arxiv under CC BY 4.0 license.