How to Convert Different Data Formats into Universal JSON with VUD

Table of Links

Abstract and Introduction
Domain and Task

2.1. Data sources and complexity

2.2. Task definition
Related Work

3.1. Text mining and NLP research overview

3.2. Text mining and NLP in industry use

3.3. Text mining and NLP for procurement

3.4. Conclusion from literature review
Proposed Methodology

4.1. Domain knowledge

4.2. Content extraction

4.3. Lot zoning

4.4. Lot item detection

4.5. Lot parsing

4.6. XML parsing, data joining, and risk indices development
Experiment and Demonstration

5.1. Component evaluation

5.2. System demonstration
Discussion

6.1. The ‘industry’ focus of the project

6.2. Data heterogeneity, multilingual and multi-task nature

6.3. The dilemma of algorithmic choices

6.4. The cost of training data
Conclusion, Acknowledgements, and References

4.2. Content extraction

In this component, our goal is to convert heterogeneous data file formats (Word, Excel, PDF, etc) into a universal, machine accessible JSON format, which we refer to as VUD. For each file, depending on its format, we use the corresponding APIs (e.g., Apache Tika for Word and Excel files, Apache Tesseract and PDFPlumber for PDF). However, not all APIs or data files support the extraction of formatting features (e.g., font size, colour, header level), especially if a document is not structurally tagged. Therefore, we focus on extracting only the text contents and a limited range of structure information. Our VUD documents contains the following structured content:

● Pages, corresponding to the pages in the source document;

● Paragraphs, identified as a consecutive block of texts separated by at least two new line characters. E.g., in Figure 1, ‘North Macedonia-Strumica: Pharmaceutical products’ and ‘2020/S 050-119757’ in the title will be treated as separate paragraphs, while ‘Legal Basis: Directive 2014/24/EU’ is a single paragraph;

● Tables, extracted using the Camelot and Tabula libraries, contain basic tabular structures including column and row indices (including column and row span information), and the textual content within each cell. Tables across multiple pages are recognised as separate tables initially, but are then merged if they meet the following simple rules: 1) there are no other non-tabular structures between two tables; 2) they have the same number of columns.

Authors:

(1) Ziqi Zhang*, Information School, the University of Sheffield, Regent Court, Sheffield, UKS1 4DP (Ziqi.Zhang@sheffield.ac.uk);

(2) Tomas Jasaitis, Vamstar Ltd., London (Tomas.Jasaitis@vamstar.io);

(3) Richard Freeman, Vamstar Ltd., London (Richard.Freeman@vamstar.io);

(4) Rowida Alfrjani, Information School, the University of Sheffield, Regent Court, Sheffield, UKS1 4DP (Rowida.Alfrjani@sheffield.ac.uk);

(5) Adam Funk, Information School, the University of Sheffield, Regent Court, Sheffield, UKS1 4DP (Adam.Funk@sheffield.ac.uk).

This paper is available on arxiv under CC BY 4.0 license.

How to Convert Different Data Formats into Universal JSON with VUD

Too Long; Didn't Read

Table of Links

4.2. Content extraction

About Author

TOPICS

THIS ARTICLE WAS FEATURED IN...

Categories

Trending Topics

How to Convert Different Data Formats into Universal JSON with VUD

Too Long; Didn't Read

Table of Links

4.2. Content extraction

About Author

TOPICS

THIS ARTICLE WAS FEATURED IN...

RELATED STORIES

Categories

Trending Topics