paint-brush
Overcoming Multilingual and Multi-Task Challenges in NLPby@textmining

Overcoming Multilingual and Multi-Task Challenges in NLP

by Text MiningDecember 26th, 2024
Read on Terminal Reader
Read this story w/o Javascript
tldt arrow

Too Long; Didn't Read

To address the complexity of multilingual and heterogeneous data, the project developed lightweight NLP methods and generalized solutions. By using text classification with adaptable domain lexicons, the approach simplifies updates and re-training while being efficient and scalable for complex tasks.
featured image - Overcoming Multilingual and Multi-Task Challenges in NLP
Text Mining HackerNoon profile picture
  1. Abstract and Introduction

  2. Domain and Task

    2.1. Data sources and complexity

    2.2. Task definition

  3. Related Work

    3.1. Text mining and NLP research overview

    3.2. Text mining and NLP in industry use

    3.3. Text mining and NLP for procurement

    3.4. Conclusion from literature review

  4. Proposed Methodology

    4.1. Domain knowledge

    4.2. Content extraction

    4.3. Lot zoning

    4.4. Lot item detection

    4.5. Lot parsing

    4.6. XML parsing, data joining, and risk indices development

  5. Experiment and Demonstration

    5.1. Component evaluation

    5.2. System demonstration

  6. Discussion

    6.1. The ‘industry’ focus of the project

    6.2. Data heterogeneity, multilingual and multi-task nature

    6.3. The dilemma of algorithmic choices

    6.4. The cost of training data

  7. Conclusion, Acknowledgements, and References

6.2. Data heterogeneity, multilingual and multi-task nature

As already mentioned many times, compared to research as well as other industry text mining and NLP practices reported in the literature, the data we dealt with are multilingual and heterogeneous such that the goal of the project cannot be achieved by a single type of method, but relates to multiple subfields of text mining and NLP research. While there may be well-established solutions in each of these subfields, as discussed before, the difference in the raw data analysed, the multilingual nature and the lack of training data in our project makes adopting these methods very difficult and time-consuming. Considering the wider principle mentioned above, another lesson we learned is that given such a high level of complexity in our tasks, it is imperative to develop ‘lightweight’ methods that are easy to maintain, or ‘generalised’solutions that could apply to multiple tasks, or both.


On reflection, we opted for treating many tasks as text classification, using a feature representation scheme that is arguably language independent, and applies to, or can be easily adapted to, all of the tasks. Specifically, two components of our pipeline (lot zoning and lot item detection) deal with text classification at different granularity and therefore, the machine learning algorithms can be easily reused across. The feature representations are derived with reference to domain lexicons, which can be more easily and affordably translated into multiple languages - particularly considering the massive amount of multilingual documents we have to analyse, and the fact that only a fraction of them contain really useful content for analysis. Another added benefit of our approach is that it enables domain experts and the system administrators to update the trained models by simply revising the domain lexicons during re-training. To the best of our knowledge, there is no prior literature that describes in detail a holistic text mining or NLP process composed of multiple sub-tasks, or develops generalisable solutions like ours. Therefore, our work sets an important reference for future industry projects dealing with complex tasks.


Authors:

(1) Ziqi Zhang*, Information School, the University of Sheffield, Regent Court, Sheffield, UKS1 4DP ([email protected]);

(2) Tomas Jasaitis, Vamstar Ltd., London ([email protected]);

(3) Richard Freeman, Vamstar Ltd., London ([email protected]);

(4) Rowida Alfrjani, Information School, the University of Sheffield, Regent Court, Sheffield, UKS1 4DP ([email protected]);

(5) Adam Funk, Information School, the University of Sheffield, Regent Court, Sheffield, UKS1 4DP ([email protected]).


This paper is available on arxiv under CC BY 4.0 license.