Listen to this story
Senior Software Architect / Engineering Lead at Behavioral Signals
Sooner or later every software engineer will come across a situation that requires some kind of text parsing. The text might contain information in semi-structured form that need to get parsed and saved into data structures. Different methods can be used to tackle the text parsing problem. One approach is to develop custom code using a line split approach and then parse the separate elements.
Another one can be to use a built-in scanner library function available in many programming languages. If you are interested to dive deep into the topic you can find a very detailed guide here, covering everything from regular expressions, grammars and lexers to parsing trees and algorithms.
In this article, I compare two parsing methods with a focus on their performance. The first method uses regular expressions for parsing and data extraction. You can find a regular expression engine in any programming language. Great examples of regular expression library capabilities are included in perl, python, php and java.
A regular expression can be modeled by a Finite State Machine (FSM). FSMs are quite flexible and can be used for any task that has distinct states of processing or operation. In parsing text using regular expressions, a class of FSMs is of particular interest, the Non-deterministic FSM or as they are also known the Non-deterministic Finite Automata (NDFA), You can easily convert a regular expression into a NDFA and then use the NDFA to parse the text input.
Using an NDFA in practice usually means a lot of backtracking. The algorithm has to follow every possible transition from a state to another and when a possible path fails it goes back and continues with another path. In theory, there are methods available that convert a NDFA to a Deterministic Finite Automaton (DFA). This conversion can be quite tricky when your parser is designed using regular expressions.
In this article, I opted for a DFA designed from scratch as the second method to parse the text and extract the data. This allowed me to compare the two methods, the NDFA and the DFA, in terms of performance.
To compare the two methods, the case study of an ini-like configuration file parser and reader was used. The input file can have configuration sections that are denoted in a single line using brackets and section names, for example:
[sectionA]
The section line can also have a comment part:
[sectionA] ; this is a comment
Comments can also exist on their own in a line:
; this line contains only comment
Each section can contain multiple lines with settings in a key-value fashion. The key name must be a string and the value can take various types. The supported types are string, integer, path, array and boolean. Some examples are given below:
keyname1 = "A string value"
keyname2 = 35363262
keyname3 = /root/.ssh/file.conf
keyname4 = value1,value2,value3
keyname5 = true
Each setting can have overrides that are loaded instead of the main value when the user enters the flag during loading. For example, you can have multiple values for key name config_path depending on the deployment system production, staging, testing:
config_path<production> = /var/lib/config/prod.cfg
config_path<testing> = /var/lib/config/prod.cfg
config_path<staging> = /var/lib/config/prod.cfg
For a more comprehensive example of the accepted input, you can check the github repository containing the code used in this article. A sample input file can be found here.
Firstly, the regular expression parser was put in action and its performance measured. The graph detailing the NDFA that was created from the regular expression is shown in graph 1.
This graph implies backtracking when a path fails until all possible paths in the graph from Start to End are exhausted. In each one of the nodes, a regular expression match is tried and if it succeeds the corresponding data is extracted if applicable.
Comment node expects the following regular expression to match
\s*;.*
Section node has a data group and expects
\s*\[([\\w_]+)\]\s*
For a full definition of the regular expressions used in this NDFA you can check the class that implements it in the source code.
Next, a parser based on an FSM implementation was build and tested on the same input data. The FSM graph is shown in graph 2.
In this case, transitions between states are deterministic and depend only on the current input character. There is no backtracking and each text line is processed character by character using the FSM. States that are tagged with the isFinal flag are valid terminal states.
If the input text is fully processed while the FSM is in a final state, the parsing is completed successfully. The color legend shows the states extract different data types from the input text. For example, light-green state 3 extracts the group section name.
For a full definition of the FSM and its implementation in Java you can check the source code here.
So what are the results of the performance comparison between the two methods? To find out, a set of text parsing operations on a large synthetic file was done. The parsing was repeated multiple times for each method and the average time was measured. The resulting data from text parsing was put onto memory in HashMap. You can see the results of the comparison for different file sizes in graph 3.
The blue line shows the parsing time of NDFA for different sizes of the input file. The green line shows the parsing time of DFA for the same input size. The input size is counted in total configuration lines and the x axis scale is logarithmic. The DFA scales well and approximates a O(logN) complexity, whereas NDFA approximates a O(N) complexity. At N = 2.4M lines DFA is 4-5x faster than NDFA. This makes DFA parser an excellent parser candidate when text parsing speed is important.
In this article, I have reviewed and compared the performance of different types of Finite Automata for the task of text parsing into a data structure in memory. A Non-Deterministic Finite Automaton based on Java's Regular Expression engine was built and tested.
Next, a Deterministic Finite Automaton was built using a custom-built Java code FSM implementation. The DFA was found at least 4 times faster than the NDFA and having logarithmic time complexity as the input size increases. You can find the full code implementation used in the article in my Github repository.
About the author
Spiros Dimopoulos currently works as a Senior Software Architect / Technical Lead at Behavioral Signals. He likes to design and implement large-scale distributed application backends and single-page web applications, embedded with Machine Learning (ML) components. From time to time he might also get his hands dirty with Infrastructure as Code (IaC) and DevOps tasks or mini IoT projects.