paint-brush

This story draft by @textmining has not been reviewed by an editor, YET.

Why Supervised Methods in NLP Struggle in Real-World Applications

featured image - Why Supervised Methods in NLP Struggle in Real-World Applications
Text Mining HackerNoon profile picture
0-item

Table of Links

  1. Abstract and Introduction

  2. Domain and Task

    2.1. Data sources and complexity

    2.2. Task definition

  3. Related Work

    3.1. Text mining and NLP research overview

    3.2. Text mining and NLP in industry use

    3.3. Text mining and NLP for procurement

    3.4. Conclusion from literature review

  4. Proposed Methodology

    4.1. Domain knowledge

    4.2. Content extraction

    4.3. Lot zoning

    4.4. Lot item detection

    4.5. Lot parsing

    4.6. XML parsing, data joining, and risk indices development

  5. Experiment and Demonstration

    5.1. Component evaluation

    5.2. System demonstration

  6. Discussion

    6.1. The ‘industry’ focus of the project

    6.2. Data heterogeneity, multilingual and multi-task nature

    6.3. The dilemma of algorithmic choices

    6.4. The cost of training data

  7. Conclusion, Acknowledgements, and References

3.4. Conclusion from literature review

Our literature review shows that, despite significant research in the areas of text mining and NLP, there is a strong dominance by supervised methods built on well-curated data that do not transfer well to practical scenarios. This is partially reflected by the number of industrial text mining/NLP studies that incorporated rule-based methods and the use of domain lexicons, except a few areas (e.g., the legal domain) where high quality curated resources are abundant. The majority of industrial studies also look at single and sometimes simplified tasks, but do not report a full process in an end-to-end fashion, particularly with a lack of details on how data heterogeneity and inconsistency is dealt with by their methods. Further, no prior work has focused on the healthcare domain. Our work will address these gaps.


Authors:

(1) Ziqi Zhang*, Information School, the University of Sheffield, Regent Court, Sheffield, UKS1 4DP ([email protected]);

(2) Tomas Jasaitis, Vamstar Ltd., London ([email protected]);

(3) Richard Freeman, Vamstar Ltd., London ([email protected]);

(4) Rowida Alfrjani, Information School, the University of Sheffield, Regent Court, Sheffield, UKS1 4DP ([email protected]);

(5) Adam Funk, Information School, the University of Sheffield, Regent Court, Sheffield, UKS1 4DP ([email protected]).


This paper is available on arxiv under CC BY 4.0 license.