paint-brush
Why Supervised Methods in NLP Struggle in Real-World Applicationsby@textmining
New Story

Why Supervised Methods in NLP Struggle in Real-World Applications

by Text MiningDecember 23rd, 2024
Read on Terminal Reader
Read this story w/o Javascript
tldt arrow

Too Long; Didn't Read

Current text mining and NLP methods rely heavily on supervised approaches and well-curated data, which don't perform well in real-world applications, especially in healthcare.
featured image - Why Supervised Methods in NLP Struggle in Real-World Applications
Text Mining HackerNoon profile picture
  1. Abstract and Introduction

  2. Domain and Task

    2.1. Data sources and complexity

    2.2. Task definition

  3. Related Work

    3.1. Text mining and NLP research overview

    3.2. Text mining and NLP in industry use

    3.3. Text mining and NLP for procurement

    3.4. Conclusion from literature review

  4. Proposed Methodology

    4.1. Domain knowledge

    4.2. Content extraction

    4.3. Lot zoning

    4.4. Lot item detection

    4.5. Lot parsing

    4.6. XML parsing, data joining, and risk indices development

  5. Experiment and Demonstration

    5.1. Component evaluation

    5.2. System demonstration

  6. Discussion

    6.1. The ‘industry’ focus of the project

    6.2. Data heterogeneity, multilingual and multi-task nature

    6.3. The dilemma of algorithmic choices

    6.4. The cost of training data

  7. Conclusion, Acknowledgements, and References

3.4. Conclusion from literature review

Our literature review shows that, despite significant research in the areas of text mining and NLP, there is a strong dominance by supervised methods built on well-curated data that do not transfer well to practical scenarios. This is partially reflected by the number of industrial text mining/NLP studies that incorporated rule-based methods and the use of domain lexicons, except a few areas (e.g., the legal domain) where high quality curated resources are abundant. The majority of industrial studies also look at single and sometimes simplified tasks, but do not report a full process in an end-to-end fashion, particularly with a lack of details on how data heterogeneity and inconsistency is dealt with by their methods. Further, no prior work has focused on the healthcare domain. Our work will address these gaps.


Authors:

(1) Ziqi Zhang*, Information School, the University of Sheffield, Regent Court, Sheffield, UKS1 4DP ([email protected]);

(2) Tomas Jasaitis, Vamstar Ltd., London ([email protected]);

(3) Richard Freeman, Vamstar Ltd., London ([email protected]);

(4) Rowida Alfrjani, Information School, the University of Sheffield, Regent Court, Sheffield, UKS1 4DP ([email protected]);

(5) Adam Funk, Information School, the University of Sheffield, Regent Court, Sheffield, UKS1 4DP ([email protected]).


This paper is available on arxiv under CC BY 4.0 license.