Domain and Task
Related Work
3.1. Text mining and NLP research overview
3.2. Text mining and NLP in industry use
4.6. XML parsing, data joining, and risk indices development
Experiment and Demonstration
Discussion
6.1. The ‘industry’ focus of the project
6.2. Data heterogeneity, multilingual and multi-task nature
In this component, our goal is to convert heterogeneous data file formats (Word, Excel, PDF, etc) into a universal, machine accessible JSON format, which we refer to as VUD. For each file, depending on its format, we use the corresponding APIs (e.g., Apache Tika for Word and Excel files, Apache Tesseract and PDFPlumber for PDF). However, not all APIs or data files support the extraction of formatting features (e.g., font size, colour, header level), especially if a document is not structurally tagged. Therefore, we focus on extracting only the text contents and a limited range of structure information. Our VUD documents contains the following structured content:
● Pages, corresponding to the pages in the source document;
● Paragraphs, identified as a consecutive block of texts separated by at least two new line characters. E.g., in Figure 1, ‘North Macedonia-Strumica: Pharmaceutical products’ and ‘2020/S 050-119757’ in the title will be treated as separate paragraphs, while ‘Legal Basis: Directive 2014/24/EU’ is a single paragraph;
● Tables, extracted using the Camelot and Tabula libraries, contain basic tabular structures including column and row indices (including column and row span information), and the textual content within each cell. Tables across multiple pages are recognised as separate tables initially, but are then merged if they meet the following simple rules: 1) there are no other non-tabular structures between two tables; 2) they have the same number of columns.
Authors:
(1) Ziqi Zhang*, Information School, the University of Sheffield, Regent Court, Sheffield, UKS1 4DP ([email protected]);
(2) Tomas Jasaitis, Vamstar Ltd., London ([email protected]);
(3) Richard Freeman, Vamstar Ltd., London ([email protected]);
(4) Rowida Alfrjani, Information School, the University of Sheffield, Regent Court, Sheffield, UKS1 4DP ([email protected]);
(5) Adam Funk, Information School, the University of Sheffield, Regent Court, Sheffield, UKS1 4DP ([email protected]).
This paper is