Domain and Task
Related Work
3.1. Text mining and NLP research overview
3.2. Text mining and NLP in industry use
4.6. XML parsing, data joining, and risk indices development
Experiment and Demonstration
Discussion
6.1. The ‘industry’ focus of the project
6.2. Data heterogeneity, multilingual and multi-task nature
As already mentioned many times, compared to research as well as other industry text mining and NLP practices reported in the literature, the data we dealt with are multilingual and heterogeneous such that the goal of the project cannot be achieved by a single type of method, but relates to multiple subfields of text mining and NLP research. While there may be well-established solutions in each of these subfields, as discussed before, the difference in the raw data analysed, the multilingual nature and the lack of training data in our project makes adopting these methods very difficult and time-consuming. Considering the wider principle mentioned above, another lesson we learned is that given such a high level of complexity in our tasks, it is imperative to develop ‘lightweight’ methods that are easy to maintain, or ‘generalised’solutions that could apply to multiple tasks, or both.
On reflection, we opted for treating many tasks as text classification, using a feature representation scheme that is arguably language independent, and applies to, or can be easily adapted to, all of the tasks. Specifically, two components of our pipeline (lot zoning and lot item detection) deal with text classification at different granularity and therefore, the machine learning algorithms can be easily reused across. The feature representations are derived with reference to domain lexicons, which can be more easily and affordably translated into multiple languages - particularly considering the massive amount of multilingual documents we have to analyse, and the fact that only a fraction of them contain really useful content for analysis. Another added benefit of our approach is that it enables domain experts and the system administrators to update the trained models by simply revising the domain lexicons during re-training. To the best of our knowledge, there is no prior literature that describes in detail a holistic text mining or NLP process composed of multiple sub-tasks, or develops generalisable solutions like ours. Therefore, our work sets an important reference for future industry projects dealing with complex tasks.
Authors:
(1) Ziqi Zhang*, Information School, the University of Sheffield, Regent Court, Sheffield, UKS1 4DP ([email protected]);
(2) Tomas Jasaitis, Vamstar Ltd., London ([email protected]);
(3) Richard Freeman, Vamstar Ltd., London ([email protected]);
(4) Rowida Alfrjani, Information School, the University of Sheffield, Regent Court, Sheffield, UKS1 4DP ([email protected]);
(5) Adam Funk, Information School, the University of Sheffield, Regent Court, Sheffield, UKS1 4DP ([email protected]).
This paper is