This story draft by @textmining has not been reviewed by an editor, YET.
Domain and Task
Related Work
3.1. Text mining and NLP research overview
3.2. Text mining and NLP in industry use
4.6. XML parsing, data joining, and risk indices development
Experiment and Demonstration
Discussion
6.1. The ‘industry’ focus of the project
6.2. Data heterogeneity, multilingual and multi-task nature
In this section, we present evaluation of some of the components above and demonstrate part of the end system.
Here, we present evaluations of lot zoning and lot item detection, for which our domain experts were able to create some gold standard for model building and evaluation. In both tasks, we use text classification and compare three machine learning algorithms: linear SVM, logistic regression (LR) and random forest (RF). Therefore, we use the standard evaluation metrics for classification: Precision, Recall and F1. These are calculated as follows. Since each task has essentially two classes, we measure the scores for each class and take their average for reporting (macro-average).
To evaluate lot zoning at page and table levels, we created two separate datasets. We asked domain experts from Vamstar to annotate a random collection of pages and tables extracted from our raw TED datasets into ‘relevant’ or ‘irrelevant’. This gave us a total of 781 pages, and 515 tables (multilingual in both cases). To evaluate lot item detection, we conducted two sets of experiments. First, using the TED database created by Vamstar, we identify the field containing values most similar to the description of lots and items, i.e., ‘lot and item descriptions’, split the values based on sentence boundaries and take the unique values. Values with less than 2 words or more than 20 words are excluded. This gives us a total of 18363 positive examples (multilingual). For negative examples, we apply the same process to the other four fields: ‘name and addresses’ of buyers, ‘notice title’, ‘short description’, and ‘contract criteria’. This gives us a total of 15632 negative examples. Combined together, we refer to this dataset as ‘lot item gold standard’. However, this only allows evaluating sentence-level classifiers, and the dataset does not necessarily represent the real documents that may describe lots in free-text and tables. Therefore, our second evaluation involves domain experts manually verifying the extractions (sentences, table rows) of the best model from 20 English tender notices for precision only. We refer to this dataset as ‘lot item verification’. Statistics of these datasets are shown in Table 1. Tables 2 ~ 3 show the evaluation results for different tasks respectively.
The results show that arguably, random forest is the most robust algorithm as it performed best on page-level lot zoning and lot item detection, while achieving comparable results on table-level lot zoning. We also conducted error analysis to understand where the automated methods failed.
In terms of the two ‘zoning’ tasks, we found three main causes that applied to both pages and tables. First, the low recall is typically due to ‘sparse’signals, such as a tender containing only one lot with very few items, described in a very short table or paragraph/list. Content as such is difficult to extract because the features generated for those text passages would be sparse compared to ‘denser’ pages or larger tables. Low precisions are primarily due to ‘noise’ in the domain lexicon, as some generic words (e.g. ‘solution’, ‘wrap’, ‘free’ as in ‘sugar free’) were retained and scored high. To address these issues, domain experts suggested developing ‘post processing’ rules to ‘recover’ passages that are potentially valid candidates for processing. For example, enforcing that at least one text passage must be extracted from each tender (thus those not making the classifier decisions may be recovered). Also, excluding unigrams in the domain lexicon could help address false positives. The third common cause of errors is due to content extraction from PDFs. As an example, arbitrary whitespaces are recognised between every letter when parsing some PDFs (e.g., ‘book’ becomes ‘b o o k’). This is an issue that is much harder to address, as it depends on the quality of PDFs as well as the third party extraction tool. This is an example of the practical challenges that industrial text mining and NLP projects can face while research may not as they deal with much higher quality data. This implies that in reality, system performance may be limited by some obstacles that are hard to overcome.
In terms of the lot item detection task, when examining the ‘lot item verification’ data, we find that the most common issue causing the decrease in precision is ‘noisy’ domain lexicon entries, as explained above. Further, complexity in the domain vocabulary also made the task challenging. For example, pharmaceutical ingredients sometimes have long names and are written in certain patterns, such as ‘17-alphahydroxyprogesterone’. For cases like these, domain experts suggested developing certain ‘regular expressions’ incorporating domain dictionaries. This implies that industrial text mining and NLP projects are often more complicated than tweaking a machine learning based method for performance measured by established standards. Practical solutions often rely on incorporating domain knowledge in various forms, oftentimes rules. This is consistent with earlier findings such as Chiticariu et al. (2013).
Authors:
(1) Ziqi Zhang*, Information School, the University of Sheffield, Regent Court, Sheffield, UKS1 4DP ([email protected]);
(2) Tomas Jasaitis, Vamstar Ltd., London ([email protected]);
(3) Richard Freeman, Vamstar Ltd., London ([email protected]);
(4) Rowida Alfrjani, Information School, the University of Sheffield, Regent Court, Sheffield, UKS1 4DP ([email protected]);
(5) Adam Funk, Information School, the University of Sheffield, Regent Court, Sheffield, UKS1 4DP ([email protected]).
This paper is