Domain and Task
Related Work
3.1. Text mining and NLP research overview
3.2. Text mining and NLP in industry use
4.6. XML parsing, data joining, and risk indices development
Experiment and Demonstration
Discussion
6.1. The ‘industry’ focus of the project
6.2. Data heterogeneity, multilingual and multi-task nature
Text mining and NLP already have wide use in a number of industry contexts. Here, due to limited space, we only look at a few domains briefly and focus only on work for real applications instead of general purpose tasks such as building domain corpora or language models, or fundamental NLP research adapted to domain specific data.
In the legal domain, Zhong et al. (2020) summarised three main application areas: judicial decision prediction, similar case matching and question answering. Judicial decision prediction studies the problem of determining the verdict of a court on the basis of textual information about a court case before the verdict was made. This is typically treated as a text classification task, where court decisions are categorised and they are predicted based on features extracted from the case texts. For details, we refer readers to a survey by Francia et al. (2022). Similar case matching is an important application where judicial decisions are made according to similar and representative cases in the past. The goal is to find pairs of similar cases. This is often cast as a retrieval task, and is being extensively studied in Legal IR (Xiao et al., 2019). Similar to judicial decision prediction, existing methods mainly focus on comparing case texts at word or semantic level. In both areas, recent studies have argued for extracting more fine-grained ‘case element’ information (e.g., through using NER) that may better represent legal cases. For example, Hu et al. (2018) extracted ‘legal attributes’ that define the nature of cases in judicial decision prediction. Question answering is a rather complicated NLP task that builds on many low-level tasks, such as finding relevant text passages (PR and text classification), locating case elements (NER and relation extraction), and text similarity matching. QA is not related to our work. For both reasons, we do not go into detail about legal QA. But generally, questions typically concern explanation of legal concepts or case analysis.
Legal documents are often lengthy, but well-structured into different parts and follow standard structures that make them easier to process. Over the years, the research community in this domain has defined granular tasks, standards, and created rich resources including training data. In contrast, our work deals with much more heterogeneous data and inconsistent structures, where creating training data for some tasks is expensive. Most established methods in this domain are therefore not directly transferable to our task.
With the growth of e-commerce and social media, mining Web content related to service/product provisions has been extensively studied and applied to develop competitive advantage for businesses. A major area of application is analysing reviews from e-commerce or social media platforms to acquire business intelligence that inform various practices. For example, through analysing customer reviews about their own services or products, one can gain insights on customer satisfaction, to inform public relation management (Nave et al., 2018) and product development (Yang et al., 2019). Analysing such data about a sector in general including multiple competitors allows discovery of market trends and even forecast of demands (Chatterjee, 2019; Sharma et al,. 2020). Further, with the increase of platforms of user-generated content, the diffusion of fake information is made easier and becoming an increasing concern. Therefore, work has been done to automatically detect and filter such misinformation (Fang et al., 2020). A thorough review of work in the above areas can be found in Kuma et al. (2021).
Work in these areas heavily rely on sentiment analysis, which is a type of text classification task aiming to analyse the sentiment of a text. It is also very useful to extract key elements related to service/product provision (e.g., product features, service processes) for more fine-grained analysis, and this can benefit from NER methods. Many studies also use topic modelling and social network analysis, which are beyond the scope of this work. Compared to the domain in our work, data studied in these areas are rather homogeneous - they are typically free-form review texts that are independent and self-contained, making it easy to adapt state-of-the-art methods. Partially for this reason, there is an abundance of tools developed in this area. However, as explained before, the data we have to deal with is much more complex.
Text mining and NLP are also widely used in system requirements analysis and quality control in construction. To name a few, Li et al. (2015) processed software user manuals (PDFs) to extract functional and non-functional requirements. A dictionary of keywords combined with rules are used to extract sentences and topic modelling is applied to group similar sentences into groups, which may represent the same/similar requirements. The work also needed to deal with filtering irrelevant text passages in the extracted texts from PDFs, but this was done manually. Our work is similar in that some of our methods make use of keywords and rules, and also process PDFs. However, we need to deal with a much larger and more heterogeneous collection and the filtering of content must be done automatically. Tiun et al. (2020) developed supervised classifiers of functional and nonfunctional system requirements using well-curated training sentences obtained from Siemens Logistics and Automotive Organization. Khan et al. (2020) analysed Reddit posts about Google Maps, and trained a text classifier to discover posts related to the software feature requirements or issues using a set of manually annotated posts. Both deal with higher quality of data and have access to a large amount of high quality training data, while our project only managed to create limited training data for some but not all tasks due to resource constraints.
Lee et al. (2014) applied keywords extraction and co-occurrence analysis in marine structure quality inspection reports. The co-occurrence map is used to aid humans in identifying which quality aspects are most frequently mentioned and in what ways. Similar methods are used in Tian et al. (2021), who processed construction project reports to extract keywords and concepts related to several key aspects of project management (e.g., quality control, safety management). Zhang et al. (2019) developed a text classifier to classify accident reports from construction projects, and implemented a rule-based extractor to identify causes from accidents. The data used in these studies are also rather simple compared to ours. For example, report cases used by Zhang et al. (2019) are composed of short paragraphs and use a consistent structure.
Authors:
(1) Ziqi Zhang*, Information School, the University of Sheffield, Regent Court, Sheffield, UKS1 4DP ([email protected]);
(2) Tomas Jasaitis, Vamstar Ltd., London ([email protected]);
(3) Richard Freeman, Vamstar Ltd., London ([email protected]);
(4) Rowida Alfrjani, Information School, the University of Sheffield, Regent Court, Sheffield, UKS1 4DP ([email protected]);
(5) Adam Funk, Information School, the University of Sheffield, Regent Court, Sheffield, UKS1 4DP ([email protected]).
This paper is